Skip to Main Content

USER_FILTER and database character set

StmNov 25 2008 — edited Nov 26 2008

I'm currently working on the integration of a tool into Oracle Text for filtering PDFs. My current approach is to call a command line tool via a USER_FILTER preference, and this works fine as long as the database character set is AL32UTF8. The tool is creating the filtered text as UTF-8.

I'm struggling now with the case that the database character set is not Unicode, for example LATIN1. I had hoped that I can specify a chain of filters for this situation when creating the index, first a USER_FILTER to get the text out of the document and then a CHARSET_FILTER to convert the filtered text from UTF-8 into the database character set. This is my attempt to set this up:

execute ctx_ddl.create_preference ('my_pdf_datastore', 'file_datastore')
execute ctx_ddl.create_preference ('my_pdf_filter', 'user_filter')
execute ctx_ddl.set_attribute ('my_pdf_filter', 'command', 'tetfilter.bat')
execute ctx_ddl.create_preference('my_cs_filter', 'CHARSET_FILTER');
execute ctx_ddl.set_attribute('my_cs_filter', 'charset', 'UTF8');
create index tetindex on pdftable (pdffile) indextype is ctxsys.context parameters ('datastore my_pdf_datastore filter my_pdf_filter filter my_cs_filter');

These are the error messages I'm getting (sorry, German Windows):
FEHLER in Zeile 1:
ORA-29855: Fehler bei Ausf³hrung der Routine ODCIINDEXCREATE
ORA-20000: Oracle Text-Fehler:
DRG-11004: Doppelter oder unvereinbarer Wert f³r FILTER
ORA-06512: in "CTXSYS.DRUE", Zeile 160

The relevant message is DRG-11004, which translates to "duplicate or incompatible value for FILTER".

So here is my question:

Do I understand it correctly that with the USER_FILTER the text is always expected in the database encoding, and that it is mandatory to create the filtered text in the database character set, or are there any alternatives?

Post Details
Added on Nov 25 2008
1 comment