I am testing text search for implementation on our Windows application by creating and querying some data through Developer that I loaded from actual documents from our system . Although I have years of query experience in SQL both MSS and Oracle, this full text indexing is completely new to me. We are currently storing some file attachments in NCLOB as a base64 string which works since the data can be any type of file. Going forward we plan to use a varbinary(max) and a blob. MSSQL is fairly straight forward on this and detects the language in my pdf stored in the database and returns the results that I expect for Arabic, Chinese, English, & French. Oracle is returning results also but I am confused about the lexer that I seem to need. If I create the index with the world or auto lexer, I only get results for single byte languages when querying in the language of the document with a known word in the document; If I use the Chinese lexer, I am getting results for all 4 languages. Note that I have tried this with and without a language and charset column specified in the index parameters and I seem to be getting the correct results without these columns.
I expected the name of "World" implied a larger character set than "Chinese". Is this result from the Chinese lexer expected or am I doing some wrong with the World lexer?
exec ctx_ddl.create_preference('MYLEXER', 'world_lexer');
-- RETURNS ONLY ENGLISH AND FRENCH RESULTS
CREATE INDEX my_docs_doc_idx ON my_docs(doc)
INDEXTYPE IS CTXSYS.CONTEXT
parameters( 'LEXER MYLEXER');
exec ctx_ddl.create_preference('CHINESE', 'CHINESE_LEXER');
-- RETURNS ARABIC, CHINESE, ENGLISH AND FRENCH RESULTS
CREATE INDEX my_docs_doc_idx ON my_docs(doc)
INDEXTYPE IS CTXSYS.CONTEXT
parameters( 'LEXER CHINESE');