Database Software

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Text Index Search Unicode Character in HMTL

User_QX1CQOct 13 2016 — edited Oct 14 2016

Oracle 12.1.0.2. I have HTML documents that I am text indexing. There are some cases where I want to search for a check box in the HTML rendered result. When the HTML uses Wingdings characters x or ý or þ for the check box, I can search for those using Oracle Text successfully by simply referencing CHR(120), CHR(253), CHR(254) in the CONTAINS. I know that's not really correct because we're talking about different character sets, but it works well as Oracle Text seems to index those HTML escaped characters as their simple database character set ASCII equivalents using the same number, regardless of what character set the HTML says they are. But when the HTML uses Unicode ☑ or ☒ or ⌧ I cannot search for it (those are all Unicode check boxes). Oracle Text seems to interpret these as non-printable characters and will not index nor search them (or it converts Unicode to its character set and the result is non-printable non-indexable), although it does seem to put some kind of a placeholder for them in the index. Below is a sample to try it. If you replace the '☒' in the HTML with 'þ', then search for CHR(254), it finds a match. But the ☒ gives funky results: if I use CHR(9746) it finds a match. But it finds a match for CHR(9745), CHR(9744), etc. as well (in addition to searching on CHR(43), CHR(18) and similar characters which also match). Basically any non-regular-language character will match. Oracle converts CHR(9746) to CHR(18). But the search will match on CHR(17) (which is what Oracle converts CHR(9745) to) and it matches to CHR(16) (which is what Oracle converts CHR(9744) to) as well. I know I am mixing character sets. I'm just trying to find a way to make a Text Index search match for these few Unicode characters.

How can I get Oracle to index these HTML tagged Unicode characters and then search on them? It's not really a matter of syncing character sets, since these are HTML escaped characters, not the actual Unicode characters themselves. My only workaround for now is to search/replace the underlying HTML document and replace all '☒' with an 'X' or similar (X), which does work (I tried it) and the HTML rendered text will look close with an 'X' instead of a check box. I'd rather not modify the HTML, but this is the only way I can think of to do it. Any ideas that leave the HTML as is?

My database character set:

NLS_NCHAR_CHARACTERSET	AL16UTF16	NCHAR Character set
NLS_CHARACTERSET	WE8MSWIN1252	Character set

Test sample:

create table temp_clob(clob_col clob);

INSERT INTO temp_clob

(clob_col)

VALUES

('<html><head><body><td style="WIDTH: 18pt; VERTICAL-ALIGN: top; align: right">

</td>

<div style="text-align: left; text-indent: 0.8pt; font-family: ''Times New Roman'', Times, serif; font-size: 10pt;">Text after the checkbox</div>

</td>');

BEGIN

--ctx_ddl.drop_preference('SEC_MASTERIDX_BASICLEXER');

ctx_ddl.create_preference('SEC_TEMP_BASICLEXER', 'BASIC_LEXER');

END;

BEGIN

--ctx_ddl.drop_section_group('SEC_MASTERIDX_HTMLSPGROUP');

ctx_ddl.create_section_group('SEC_TEMP_HTMLSPGROUP', 'HTML_SECTION_GROUP');

ctx_ddl.add_special_section('SEC_TEMP_HTMLSPGROUP', 'SENTENCE');

ctx_ddl.add_special_section('SEC_TEMP_HTMLSPGROUP', 'PARAGRAPH');

END;

CREATE INDEX temp_ctxidx ON temp_clob(clob_col)

INDEXTYPE IS ctxsys.context

PARAMETERS('DATASTORE CTXSYS.DEFAULT_DATASTORE

SECTION GROUP SEC_MASTERIDX_HTMLSPGROUP

LEXER SEC_MASTERIDX_BASICLEXER');

--This matches

SELECT tc.clob_col

FROM temp_clob tc

WHERE CONTAINS(tc.clob_col, CHR(9746) || ' text after', 1) > 0

--But so does this

SELECT tc.clob_col

FROM temp_clob tc

WHERE CONTAINS(tc.clob_col, CHR(9745) || ' text after', 1) > 0

--And this

SELECT tc.clob_col

FROM temp_clob tc

WHERE CONTAINS(tc.clob_col, CHR(43) || ' text after', 1) > 0

Again, if I change the ☒ in the HTML to X then I can search on CHR(88) || ' text after', or on 'x text after' and it matches. But it does not match (correctly so) to CHR(87) or similar. So that works. So I know Oracle is trying to index the HTML escaped characters when it can. But it doesn't work with Unicode characters because it doesn't seem to know how to interpret those.

This post has been answered by Bud Light on Oct 13 2016

Jump to Answer

Locked Post

New comments cannot be posted to this locked post.

Locked on Nov 11 2016

Added on Oct 13 2016

#database-key-features, #text

5 comments

1,397 views