Oracle 12.1.0.2. I have HTML documents that I am text indexing. There are some cases where I want to search for a check box in the HTML rendered result. When the HTML uses Wingdings characters x or ý or þ for the check box, I can search for those using Oracle Text successfully by simply referencing CHR(120), CHR(253), CHR(254) in the CONTAINS. I know that's not really correct because we're talking about different character sets, but it works well as Oracle Text seems to index those HTML escaped characters as their simple database character set ASCII equivalents using the same number, regardless of what character set the HTML says they are. But when the HTML uses Unicode ☑ or ☒ or ⌧ I cannot search for it (those are all Unicode check boxes). Oracle Text seems to interpret these as non-printable characters and will not index nor search them (or it converts Unicode to its character set and the result is non-printable non-indexable), although it does seem to put some kind of a placeholder for them in the index. Below is a sample to try it. If you replace the '☒' in the HTML with 'þ', then search for CHR(254), it finds a match. But the ☒ gives funky results: if I use CHR(9746) it finds a match. But it finds a match for CHR(9745), CHR(9744), etc. as well (in addition to searching on CHR(43), CHR(18) and similar characters which also match). Basically any non-regular-language character will match. Oracle converts CHR(9746) to CHR(18). But the search will match on CHR(17) (which is what Oracle converts CHR(9745) to) and it matches to CHR(16) (which is what Oracle converts CHR(9744) to) as well. I know I am mixing character sets. I'm just trying to find a way to make a Text Index search match for these few Unicode characters.
How can I get Oracle to index these HTML tagged Unicode characters and then search on them? It's not really a matter of syncing character sets, since these are HTML escaped characters, not the actual Unicode characters themselves. My only workaround for now is to search/replace the underlying HTML document and replace all '☒' with an 'X' or similar (X), which does work (I tried it) and the HTML rendered text will look close with an 'X' instead of a check box. I'd rather not modify the HTML, but this is the only way I can think of to do it. Any ideas that leave the HTML as is?
My database character set:
NLS_NCHAR_CHARACTERSET | AL16UTF16 | NCHAR Character set |
NLS_CHARACTERSET | WE8MSWIN1252 | Character set |
Test sample:
create table temp_clob(clob_col clob);
INSERT INTO temp_clob
(clob_col)
VALUES
('<html><head><body><td style="WIDTH: 18pt; VERTICAL-ALIGN: top; align: right">
<div style="text-align: left; text-indent: 0.8pt; font-family: ''Segoe UI Symbol'', sans-serif; font-size: 10pt;">☒</div>
</td>
<td style="WIDTH: auto; VERTICAL-ALIGN: top; align: left">
<div style="text-align: left; text-indent: 0.8pt; font-family: ''Times New Roman'', Times, serif; font-size: 10pt;">Text after the checkbox</div>
</td>');
BEGIN
--ctx_ddl.drop_preference('SEC_MASTERIDX_BASICLEXER');
ctx_ddl.create_preference('SEC_TEMP_BASICLEXER', 'BASIC_LEXER');
END;
BEGIN
--ctx_ddl.drop_section_group('SEC_MASTERIDX_HTMLSPGROUP');
ctx_ddl.create_section_group('SEC_TEMP_HTMLSPGROUP', 'HTML_SECTION_GROUP');
ctx_ddl.add_special_section('SEC_TEMP_HTMLSPGROUP', 'SENTENCE');
ctx_ddl.add_special_section('SEC_TEMP_HTMLSPGROUP', 'PARAGRAPH');
END;
CREATE INDEX temp_ctxidx ON temp_clob(clob_col)
INDEXTYPE IS ctxsys.context
PARAMETERS('DATASTORE CTXSYS.DEFAULT_DATASTORE
SECTION GROUP SEC_MASTERIDX_HTMLSPGROUP
LEXER SEC_MASTERIDX_BASICLEXER');
--This matches
SELECT tc.clob_col
FROM temp_clob tc
WHERE CONTAINS(tc.clob_col, CHR(9746) || ' text after', 1) > 0
--But so does this
SELECT tc.clob_col
FROM temp_clob tc
WHERE CONTAINS(tc.clob_col, CHR(9745) || ' text after', 1) > 0
--And this
SELECT tc.clob_col
FROM temp_clob tc
WHERE CONTAINS(tc.clob_col, CHR(43) || ' text after', 1) > 0
Again, if I change the ☒ in the HTML to X then I can search on CHR(88) || ' text after', or on 'x text after' and it matches. But it does not match (correctly so) to CHR(87) or similar. So that works. So I know Oracle is trying to index the HTML escaped characters when it can. But it doesn't work with Unicode characters because it doesn't seem to know how to interpret those.