Oracle Text 11 - Auto Lexer Issue, Language Detection, Alternate Spelling
880175Aug 5 2011 — edited Aug 10 2011Hello,
I'm trying to set up an auto lexer to index documents in English, French, German, Italian and maybe Spanish. I encounter several problems however, even though some might be to misuse of Oracle Text as I'm still learning. I would greatly welcome advice and help from the fellow expert around.
-- Version
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
PL/SQL Release 11.2.0.2.0 - Production
CORE 11.2.0.2.0 Production
TNS for Solaris: Version 11.2.0.2.0 - Production
NLSRTL Version 11.2.0.2.0 - Production
--Create a Table
create table mytest(id number primary key, docs clob, lang VARCHAR2(30));
--Populate the Table with Data
INSERT INTO mytest VALUES(1, 'Je vais mourir', 'french');
INSERT INTO mytest VALUES(2, 'Le chien est mort', 'french');
INSERT INTO mytest VALUES(3, 'Le chien était mort', 'french');
INSERT INTO mytest VALUES(4, 'Il est content', 'french');
INSERT INTO mytest VALUES(5, 'Nous sommes heureux', 'french');
INSERT INTO mytest VALUES(6, 'Il fait beau aujourd''hui', 'french');
INSERT INTO mytest VALUES(7, 'Rotes Auto', 'german');
INSERT INTO mytest VALUES(8, 'Roter Zug', 'german');
INSERT INTO mytest VALUES(9, 'Grün, Blau, Rot', 'german');
INSERT INTO mytest VALUES(10, 'Ich bin zufrieden', 'german');
INSERT INTO mytest VALUES(11, 'des seins', 'french');
INSERT INTO mytest VALUES(12, 'Hauptbahnhof', 'german');
INSERT INTO mytest VALUES(13, 'Lokomotivführer', 'german');
commit;
-- Create Index
begin
ctx_ddl.create_preference('single_lexer','auto_lexer');
ctx_ddl.set_attribute('single_lexer','mixed_case','no');
ctx_ddl.set_attribute('single_lexer','base_letter','no');
ctx_ddl.set_attribute('single_lexer','base_letter_type','SPECIFIC');
ctx_ddl.set_attribute('single_lexer', 'index_stems', 'YES');
ctx_ddl.set_attribute('single_lexer', 'german_decompound', 'YES');
ctx_ddl.set_attribute('single_lexer', 'alternate_spelling', 'GERMAN');
end;
drop index myindex;
create index myindex on mytest(docs)
indextype is ctxsys.context
parameters ('LANGUAGE COLUMN lang LEXER single_lexer');
-- Problems
1) Id#7 is indexed under the Spanish word 'rotar' instead of the German word 'rot' analog to Id#8 and Id#9. I believed in specifying a language column I would force the lexer to use the correct stemming dictionary. Is it a bug? Or did I miss something?
2) Query not giving expected result
SELECT SCORE(1), id, lang, docs
FROM mytest
WHERE CONTAINS(docs, '<query><textquery lang="german">Gruen</textquery></query>', 1) > 0;
--> No result
This query should output 1 result (namely Id#9), according to the alternate_spelling=GERMAN parameter. Is it a bug? Or did I miss something?
3) Composite German words in Id#12(Hauptbahnhof) und Id#13(Lokomotivführer) should be decomposed in "Haupt", "Bahnhof", "Lokomotiv", and "Führer" according to the german_decompound=YES parameter. In fact only Id#13(Lokomotivführer) is correctly decomposed. Previously I had no problem decomposing both words when using the composite=GERMAN parameter of the BASIC_LEXER. Is it a bug? Or did I miss something?
Thanks you very much for any hints or answers that would make me closer to the solution.
Frederic