Skip to Main Content

Database Software

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Parse and extract information from pdf files in Oracle 12c

Colin ZhengMay 19 2015 — edited May 21 2015

Hi all,

I am new to the Oracle database and Oracle Text. Sorry to have you read the long question but I just want to make it clear.

There are some pdf files stored as BLOB type in the database. All these files have a table of contents which contains some subtitles for the document. For example, the table of contents says the document has a “description”, a “preparation” and an “operation”. What I need to do is to extract the contents below those subtitles respectively and put them into a table with three columns “description”, “preparation” and “operation”.

My first approach is to use ctx_doc.snippet to try to get a snippet around the keywords (like “description”). But the size of the snippet is kind of fixed and I was not able to get a precise result. (when applying “description” as keyword against content like “...ABCDEFG. Description: abcdefg...”, the snippet method returns “DEFG. Description: abcd” but what i want is “abcdefg...”).

The second approach I tried is to customize upon the ctx_doc.markup method. Since this method will return an html version of the pdf file with the keyword surrounded by some special tag(like <<< and >>>), I am able to use the instr method against the html file and get the precise location of each subtitle. What I did is to use the substr method to extract the text among these locations.The following is my tryout on ctx_doc.markup for test purpose:

set serveroutput on;

declare

  mklob clob;

begin

  ctx_doc.markup(

index_name => ‘pdf_index’,

textkey => ‘1’,

text_query => ‘description’,

restab => mklob);

dbms_output.put_line(mklob);

dbms_lob.freetemporary(mklob);

end;

/

It will be able to print out the html version of the pdf if it is just about 3 to 4 pages of text. But when it turns to a bit larger file about 8 to 10 pages of pure text, it seems that the temporary clob mklob is not able to store the html version of the pdf and returns the error as below:

Error report:

ORA-06502: PL/SQL: numeric or value error

ORA-06512: at line 9

06502. 00000 -  "PL/SQL: numeric or value error%s"

Theoretically the clob should be able to hold up to 4GB of text. The Oracle documentation says there is a difference between in-line declared clob and out-of-line declared clob, and it says that an in-line declared clob can just hold up to 4kb of data, anything exceeds that size should be stored as out-of-line clob. But will it be an issue if just storing the data in a temporary clob as what I did in the code above? If it is, anyone has an idea to fix that?

Also I will appreciate any great idea of improving upon my current approach or other smarter ways to attack the problem of extracting information from this kind of pdf files. (I see some folks doing customized filter instead of the auto_filter provided by Oracle which sometimes mess up the order the text in the pdf files. I will appreciate any reference of how to customize a filter to improve the performance.)


Comments
Locked Post
New comments cannot be posted to this locked post.
Post Details
Locked on Jun 18 2015
Added on May 19 2015
8 comments
7,117 views