Skip to Main Content

Oracle Database Discussions

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Interested in getting your voice heard by members of the Developer Marketing team at Oracle? Check out this post for AppDev or this post for AI focus group information.

Tokenization in Oracle Text

Sayantan Chatterjee-OracleOct 18 2024 — edited Oct 18 2024

Hi Community,
My team is exploring the potential use of Oracle Text in our search functionality.
While exploring Oracle Text to see how it can solve our exact use case I am facing some questions and doubts.

During indexing the below lexer config. I can retain the entire word. But smaller blocks seperated by special characters arent tokenized. And without the basic lexer I can only get the smaller blocks but not the whole string.
Example
Without Basic Lexer aab_cdf/e → aab cdf e
Wtith Basic Lexer aab_cdf/e → aab_cdf/e
What i want is aab_cdf/e → aab cdf e aab_cdf/e

Lexer Config used
exec ctx_ddl.create_preference('quote_lexer', 'BASIC_LEXER');
exec ctx_ddl.set_attribute('quote_lexer', 'printjoins', './_:');
exec ctx_ddl.set_attribute('quote_lexer', 'whitespace', './_: ');
exec ctx_ddl.set_attribute('quote_lexer', 'index_themes', 'NO');
exec ctx_ddl.set_attribute('quote_lexer', 'index_text', 'YES');

Can we create ngrams while tokenization ( In english ) ?
For context ngrams are
An ngram is a contiguous sequence of _**n**_ characters from a given sequence of text. The ngram parser tokenizes a sequence of text into a contiguous sequence of _**n**_ characters. For example, you can tokenize “abcd” for different values of _**n**_ using the ngram full-text parser.
n=1: 'a', 'b', 'c', 'd'
n=2: 'ab', 'bc', 'cd'
n=3: 'abc', 'bcd'
n=4: 'abcd'

During MultiColumn Datastore preference what is the relevance of adding one of the column name while creating index even though all the columns data are joined.

Does search query written to search a indexed column get tokenized the same way data was indexed before searching or a blind search is done. And can we control it.

What scoring mechanism is used to rank results and can we control it.

Any help or pointing to the right direction would be really helpful.

Comments
Post Details
Added on Oct 18 2024
1 comment
101 views