This isn't an especially SQL or PL/SQL question, more a general programming one, but I'd be grateful for any help.
I have a table containing lots of words, called SEARCH_DICTIONARY. Consider it to be just a list of every English word that exists.
I have users submitting search phrases using free-form text
I need to write something that will suggest alternative search phrases to those submitted when those submitted contain a spelling error.
I have a 'spellcheck' function which lists word alternatives based on utl_match.edit_distance_similarity
I now write the following function:
create or replace function search_suggest (searchphrase varchar2)
return varchar2 is
loopcount number;
loopmax number;
v_count number;
goodword varchar2(2000);
badword varchar2(2000);
searchterm varchar2(2000);
goodphrase varchar2(32000);
begin
loopcount := 1;
select listlen(searchphrase,' ') into loopmax from dual;
loop
select listgetat(searchphrase,loopcount,' ') into searchterm from dual;
select count(1) into v_count from search_dictionary where mainterm=searchterm;
if v_count > 0 then
goodword := searchterm;
goodphrase := goodphrase||' '||goodword;
else
badword := spellcheck(searchterm,75);
goodphrase := goodphrase||' '||badword;
end if;
loopcount := loopcount+1;
exit when loopcount>loopmax;
end loop;
RETURN goodphrase;
end;
This sort-of works:
select search_suggest('man with riod tomaghto') from dual;
ALT_SEARCH_SUGGEST('MANWITHRIO
-------------------------------
man with rind,riot,rod,rood tomato
That's correctly spotting that 'tomaghto' should be 'tomato', and although it's not getting "red" for 'riod', I'm not asking for miracles. It will do, since at least it's suggesting valid English words as an alternative.
The key thing is that my SPELLCHECK function spits out a comma-separated list of near-matches for any word it's fed. In this case, when I say that the 'edit-distance-similarity' should be greater than 75, the function tells me that RIND, RIOT, ROD and ROOD are alternatives for the 'riod' originally supplied. It also correctly spots that 'tomato' is the only near-equivalent for 'tomaghto'. The code also correctly spots that 'Man' and 'with' are valid dictionary words, and therefore doesn't pass them to the SPELLCHECK function at all. As I say, it's not perfect, but it will do.
Or rather, it
would do if only I could output my results in the form:
man with rind tomato
man with riot tomato
man with rod tomato
man with rood tomato
-i.e, as four phrases containing all words and all combinations of words, rather than as one phrase containing a chunk of bad word alternatives buried in the middle of the good words as at present.
I am simply getting very confused keeping track of what words are good (and therefore don't need a lookup with the SPELLCHECK function) and which are bad, and therefore am having great difficulty being able to repeat all the good words (in the right order!) for each of the bad ones found by my SPELLCHECK function.
Can anyone suggest a re-working of my code to achieve the desired output, please?