Thread: Including/parsing out text from within document in query..


Permlink Replies: 7 - Pages: 1 - Last Post: Jan 19, 2006 8:24 PM Last Post By: iamnoel
iamnoel

Posts: 62
Registered: 05/02/01
Including/parsing out text from within document in query..
Posted: Jan 13, 2006 4:22 PM
Click to report abuse...   Click to reply to this thread Reply
I would like to know if there is a way to include text from inside a document in the data yielded by an Oracle Text CONTAINS query. I've got a table with the fields: id, item (in which 'item' is a blob including word docs and pdf files), in which case I could execute this query:

select id from items
where contains(item, '%content%', 1) > 0;

I want to then, in this case, collect the e-mail address(es) in each document yielded by the query to provide back to the client. I'd imagine I could write an expression to filter these out, illustrated not very well perhaps in this [pseudo]query:

select id, (item ~ '^.@.
..+$') into email from items
where contains(item, '%content%', 1) > 0;

In this case, 'email' would include each instance of text within the doc which conformed to the expression in the query. I'm not certain if this is even close to what I'd need to do; but is there a way to do this?

Currently installed: Oracle Database 10g Express Edition 10.2.0.1.0 - Beta

Thank you.
iamnoel

Posts: 62
Registered: 05/02/01
Re: Including/parsing out text from within document in query..
Posted: Jan 18, 2006 7:09 AM   in response to: iamnoel in response to: iamnoel
Click to report abuse...   Click to reply to this thread Reply
If this can't be done; I'd be interested in why - conceptually, the content's been indexed and is therefore identifiable within the database, in which case it's odd to me that I couldn't parse out certain parts of it.

Is there documentation on this which could provide insight on why I'd be incorrect on that? Thanks.

In either case, for now what I'm thinking I'll do is parse out the content of interest with xpdf/pdftotext prior to importing, than include this with the blob; but it just feels inefficient/wrong..
Barbara Boehmer

Posts: 2,840
Registered: 04/28/00
Re: Including/parsing out text from within document in query..
Posted: Jan 18, 2006 11:21 AM   in response to: iamnoel in response to: iamnoel
Click to report abuse...   Click to reply to this thread Reply
If you specify printjoins for "@" and ".", then your emails will be stored in the token_text column of dr$your_index_name$i. Otherwise an email address of person@company.org would be parsed and stored as separate tokens of person and company and org.

The biggest part of the problem is identifying which row of dr$your_index_name$i is associated with which row of your table, so that you can join them and extract the email from the correct token_text. There is a dr$your_index_name$k that contains the rowid in the testkey column that corresponds to the rowid in your table and also contains a docid that seems to correspond to the range identified by the token_first and token_last columns in the dr$your_index_name$i. The problem is that, since it is a range, if you only use the first and last for joining, then you may miss valid values inbetween and if you use the whole range, then you may get additional invalid values. So, although these things must be readily available to Oracle internally somehow, they are not readily available to the user. Please see the demonstration below that shows the problem.

For this demonstration, the contents of the word documents are as follows:

-- contents of c:\oracle\word_do1.doc:
some content from somebody@hotmail.com

--- contents of c:\oracle\word_do2.doc:
some content from anybody@whatever.org and somebody@hotmail.com

-- contents of c:\oracle\word_do3.doc:
some content from nobody@hotmail.com

-- contents of c:\oracle\word_do4.doc:
Some content from somebody@hotmail.com and everybody@yahoo.com

scott@ORA92> -- set up test data:
scott@ORA92> CREATE TABLE items
  2    (id NUMBER,
  3  	item BLOB)
  4  / 
 
Table created.
 
scott@ORA92> INSERT ALL
  2  INTO items (id, item) VALUES (10, EMPTY_BLOB())
  3  INTO items (id, item) VALUES (20, EMPTY_BLOB())
  4  INTO items (id, item) VALUES (30, EMPTY_BLOB())
  5  INTO items (id, item) VALUES (40, EMPTY_BLOB())
  6  SELECT * FROM DUAL
  7  / 
 
4 rows created.
 
scott@ORA92> CREATE OR REPLACE DIRECTORY my_dir AS 'c:\oracle'
  2  / 
 
Directory created.
 
scott@ORA92> DECLARE
  2    v_blob  BLOB;
  3    v_bfile BFILE;
  4  BEGIN
  5    SELECT item
  6    INTO   v_blob
  7    FROM   items
  8    WHERE  id = 10
  9    FOR UPDATE;
 10    v_bfile := BFILENAME ('MY_DIR', 'word_doc1.doc');
 11    DBMS_LOB.FILEOPEN (v_bfile, DBMS_LOB.FILE_READONLY);
 12    DBMS_LOB.LOADFROMFILE (v_blob, v_bfile, DBMS_LOB.GETLENGTH (v_bfile));
 13    DBMS_LOB.FILECLOSE (v_bfile);
 14  
 15    SELECT item
 16    INTO   v_blob
 17    FROM   items
 18    WHERE  id = 20
 19    FOR UPDATE;
 20    v_bfile := BFILENAME ('MY_DIR', 'word_doc2.doc');
 21    DBMS_LOB.FILEOPEN (v_bfile, DBMS_LOB.FILE_READONLY);
 22    DBMS_LOB.LOADFROMFILE (v_blob, v_bfile, DBMS_LOB.GETLENGTH (v_bfile));
 23    DBMS_LOB.FILECLOSE (v_bfile);
 24  
 25    SELECT item
 26    INTO   v_blob
 27    FROM   items
 28    WHERE  id = 30
 29    FOR UPDATE;
 30    v_bfile := BFILENAME ('MY_DIR', 'word_doc3.doc');
 31    DBMS_LOB.FILEOPEN (v_bfile, DBMS_LOB.FILE_READONLY);
 32    DBMS_LOB.LOADFROMFILE (v_blob, v_bfile, DBMS_LOB.GETLENGTH (v_bfile));
 33    DBMS_LOB.FILECLOSE (v_bfile);
 34  
 35    SELECT item
 36    INTO   v_blob
 37    FROM   items
 38    WHERE  id = 40
 39    FOR UPDATE;
 40    v_bfile := BFILENAME ('MY_DIR', 'word_doc4.doc');
 41    DBMS_LOB.FILEOPEN (v_bfile, DBMS_LOB.FILE_READONLY);
 42    DBMS_LOB.LOADFROMFILE (v_blob, v_bfile, DBMS_LOB.GETLENGTH (v_bfile));
 43    DBMS_LOB.FILECLOSE (v_bfile);
 44  END;
 45  / 
 
PL/SQL procedure successfully completed.
 
scott@ORA92> -- create index using preference with printjoins:
scott@ORA92> EXEC CTX_DDL.CREATE_PREFERENCE ('my_lexer', 'BASIC_LEXER')
 
PL/SQL procedure successfully completed.
 
scott@ORA92> EXEC CTX_DDL.SET_ATTRIBUTE ('my_lexer', 'PRINTJOINS', '@.')
 
PL/SQL procedure successfully completed.
 
scott@ORA92> CREATE INDEX items_idx
  2  ON items (item)
  3  INDEXTYPE IS CTXSYS.CONTEXT
  4  PARAMETERS ('LEXER my_lexer')
  5  / 
 
Index created.
 
scott@ORA92> -- additional data that Oracle context stores:
scott@ORA92> COLUMN token_text FORMAT A30
scott@ORA92> SELECT token_text, token_type, token_first, token_last, token_count
  2  FROM   dr$items_idx$i
  3  / 
 
TOKEN_TEXT                     TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT
------------------------------ ---------- ----------- ---------- -----------
ANYBODY@WHATEVER.ORG                    0           2          2           1
CONTENT                                 0           1          4           4
EVERYBODY@YAHOO.COM                     0           4          4           1
NOBODY@HOTMAIL.COM                      0           3          3           1
SOMEBODY@HOTMAIL.COM                    0           1          4           3
 
scott@ORA92> DESC dr$items_idx$k
 Name                                                  Null?    Type
 ----------------------------------------------------- -------- ------------------------------------
 DOCID                                                          NUMBER(38)
 TEXTKEY                                               NOT NULL ROWID
 
scott@ORA92> SELECT * FROM dr$items_idx$k
  2  / 
 
     DOCID TEXTKEY
---------- ------------------
         1 AAAdx4AABAAANQ6AAA
         2 AAAdx4AABAAANQ6AAB
         3 AAAdx4AABAAANQ6AAC
         4 AAAdx4AABAAANQ6AAD
 
scott@ORA92> -- queries:
scott@ORA92> SELECT id, ROWID
  2  FROM   items
  3  WHERE  CONTAINS (item, '%content%') > 0
  4  ORDER  BY id
  5  / 
 
        ID ROWID
---------- ------------------
        10 AAAdx4AABAAANQ6AAA
        20 AAAdx4AABAAANQ6AAB
        30 AAAdx4AABAAANQ6AAC
        40 AAAdx4AABAAANQ6AAD
 
scott@ORA92> COLUMN email FORMAT A30
scott@ORA92> SELECT items.id,
  2  	    dr$items_idx$i.token_text AS email
  3  FROM   items,
  4  	    dr$items_idx$k,
  5  	    dr$items_idx$i
  6  WHERE  CONTAINS (items.item, '%content%') > 0
  7  AND    dr$items_idx$i.token_text LIKE '%@%.%'
  8  AND    items.ROWID = dr$items_idx$k.textkey
  9  -- this last join condition misses some correct results:
 10  AND    (dr$items_idx$k.docid = dr$items_idx$i.token_first
 11  	     OR dr$items_idx$k.docid = dr$items_idx$i.token_last)
 12  ORDER  BY id, email
 13  / 
 
        ID EMAIL
---------- ------------------------------
        10 SOMEBODY@HOTMAIL.COM
        20 ANYBODY@WHATEVER.ORG
        30 NOBODY@HOTMAIL.COM
        40 EVERYBODY@YAHOO.COM
        40 SOMEBODY@HOTMAIL.COM
 
scott@ORA92> SELECT items.id,
  2  	    dr$items_idx$i.token_text AS email
  3  FROM   items,
  4  	    dr$items_idx$k,
  5  	    dr$items_idx$i
  6  WHERE  CONTAINS (items.item, '%content%') > 0
  7  AND    dr$items_idx$i.token_text LIKE '%@%.%'
  8  AND    items.ROWID = dr$items_idx$k.textkey
  9  -- this last join condition includes incorrect results:
 10  AND    dr$items_idx$k.docid
 11  	    BETWEEN dr$items_idx$i.token_first
 12  	    AND     dr$items_idx$i.token_last
 13  ORDER  BY id, email
 14  / 
 
        ID EMAIL
---------- ------------------------------
        10 SOMEBODY@HOTMAIL.COM
        20 ANYBODY@WHATEVER.ORG
        20 SOMEBODY@HOTMAIL.COM
        30 NOBODY@HOTMAIL.COM
        30 SOMEBODY@HOTMAIL.COM
        40 EVERYBODY@YAHOO.COM
        40 SOMEBODY@HOTMAIL.COM
 
7 rows selected.
Barbara Boehmer

Posts: 2,840
Registered: 04/28/00
Re: Including/parsing out text from within document in query..
Posted: Jan 18, 2006 11:48 AM   in response to: iamnoel in response to: iamnoel
Click to report abuse...   Click to reply to this thread Reply
If your data was stored as varchar2 or clob, instead of blob, you could do something like this:

scott@ORA92> CREATE TABLE items
  2    (id NUMBER,
  3  	item VARCHAR2(100))
  4  / 
 
Table created.
 
scott@ORA92> INSERT ALL
  2  INTO items VALUES (1, 'some content from somebody@hotmail.com')
  3  INTO items VALUES (2, 'some content from anybody@whatever.org and somebody@hotmail.com')
  4  INTO items VALUES (3, 'some content from nobody@hotmail.com')
  5  INTO items VALUES (4, 'Some content from somebody@hotmail.com and everybody@yahoo.com')
  6  SELECT * FROM DUAL
  7  / 
 
4 rows created.
 
scott@ORA92> CREATE INDEX items_idx
  2  ON items (item)
  3  INDEXTYPE IS CTXSYS.CONTEXT
  4  / 
 
Index created.
 
scott@ORA92> CREATE OR REPLACE FUNCTION list_element
  2  	(p_string    VARCHAR2,
  3  	 p_element   INTEGER,
  4  	 p_separator VARCHAR2 DEFAULT ' ')
  5  	RETURN	     VARCHAR2
  6  AS
  7    v_string      VARCHAR2(32767);
  8  BEGIN
  9    v_string := p_string || p_separator;
 10    FOR i IN 1 .. p_element - 1 LOOP
 11  	 v_string := SUBSTR (v_string,
 12  			     INSTR (v_string, p_separator)
 13  			     + LENGTH (p_separator));
 14    END LOOP;
 15    RETURN SUBSTR (v_string, 1, INSTR (v_string, p_separator) - 1);
 16  END list_element;
 17  / 
 
Function created.
 
scott@ORA92> COLUMN email FORMAT A30
scott@ORA92> SELECT id,
  2  	    list_element (item, element) AS email
  3  FROM   items,
  4  	    (SELECT ROWNUM AS element
  5  	     FROM   DUAL CONNECT BY LEVEL < 100) -- checks first 100 words
  6  WHERE  CONTAINS (item, '%content%') > 0
  7  AND    list_element (item, element) LIKE '%@%.%'
  8  ORDER  BY id, email
  9  / 
 
        ID EMAIL
---------- ------------------------------
         1 somebody@hotmail.com
         2 anybody@whatever.org
         2 somebody@hotmail.com
         3 nobody@hotmail.com
         4 everybody@yahoo.com
         4 somebody@hotmail.com
 
6 rows selected.
iamnoel

Posts: 62
Registered: 05/02/01
Re: Including/parsing out text from within document in query..
Posted: Jan 19, 2006 9:16 AM   in response to: Barbara Boehmer in response to: Barbara Boehmer
Click to report abuse...   Click to reply to this thread Reply
Intriguing..thank you for the very informative input on this. I can't come up with a way to improve on it, either..I imagine what I'll do then is execute the 1st query (which yields perhaps fewer than it ought to), provide these to the client as 'certain', then execute the 2nd query and include those not yielded by the 1st as 'not confirmed'. I can include $i.token_count to perhaps improve on this..

I'm just wondering why there is no $i.token_key = $k.textkey. I'd just be interested what the basis is..
Barbara Boehmer

Posts: 2,840
Registered: 04/28/00
Re: Including/parsing out text from within document in query..
Posted: Jan 19, 2006 9:37 AM   in response to: iamnoel in response to: iamnoel
Click to report abuse...   Click to reply to this thread Reply
I believe the following should solve the problem and provide all of the correct information, without any omissions or incorrect data.

scott@ORA92> SELECT t1.id,
  2  	    t2.email
  3  FROM   items t1,
  4  	    (SELECT items.id,
  5  		    dr$items_idx$i.token_text AS email
  6  	     FROM   items,
  7  		    dr$items_idx$k,
  8  		    dr$items_idx$i
  9  	     WHERE  CONTAINS (items.item, '%content%') > 0
 10  	     AND    dr$items_idx$i.token_text LIKE '%@%.%'
 11  	     AND    items.ROWID = dr$items_idx$k.textkey
 12  	     AND    dr$items_idx$k.docid
 13  		    BETWEEN dr$items_idx$i.token_first
 14  		    AND     dr$items_idx$i.token_last) t2
 15  WHERE  t1.id = t2.id
 16  AND    CONTAINS (t1.item, t2.email) > 0
 17  ORDER  BY id, email
 18  / 
 
        ID EMAIL
---------- ------------------------------
        10 SOMEBODY@HOTMAIL.COM
        20 ANYBODY@WHATEVER.ORG
        20 SOMEBODY@HOTMAIL.COM
        30 NOBODY@HOTMAIL.COM
        40 EVERYBODY@YAHOO.COM
        40 SOMEBODY@HOTMAIL.COM
 
6 rows selected.
Barbara Boehmer

Posts: 2,840
Registered: 04/28/00
Re: Including/parsing out text from within document in query..
Posted: Jan 19, 2006 9:51 AM   in response to: iamnoel in response to: iamnoel
Click to report abuse...   Click to reply to this thread Reply
Now that I have thought about it a bit more, I see that the following also works, and more simply and efficiently.

scott@ORA92> SELECT items.id,
  2  	    dr$items_idx$i.token_text AS email
  3  FROM   items,
  4  	    dr$items_idx$i
  5  WHERE  CONTAINS (items.item, '%content%') > 0
  6  AND    dr$items_idx$i.token_text LIKE '%@%.%'
  7  AND    CONTAINS (items.item, dr$items_idx$i.token_text) > 0
  8  ORDER  BY id, email
  9  / 
 
        ID EMAIL
---------- ------------------------------
        10 SOMEBODY@HOTMAIL.COM
        20 ANYBODY@WHATEVER.ORG
        20 SOMEBODY@HOTMAIL.COM
        30 NOBODY@HOTMAIL.COM
        40 EVERYBODY@YAHOO.COM
        40 SOMEBODY@HOTMAIL.COM
 
6 rows selected.
 
scott@ORA92>
iamnoel

Posts: 62
Registered: 05/02/01
Re: Including/parsing out text from within document in query..
Posted: Jan 19, 2006 4:34 PM   in response to: Barbara Boehmer in response to: Barbara Boehmer
Click to report abuse...   Click to reply to this thread Reply
Very nice..this works exactly as intended. I'm indebted to you, certainly. I've written this into a view I can query in code instead, in case the internal index tables change..:

CREATE VIEW idx_email (id, email, count) AS
SELECT i.id, idx_i.token_text AS email, idx_i.token_count AS count
FROM items i, dr$items_idx$i idx_i
WHERE idx_i.token_text LIKE '%@%.%'
AND CONTAINS(i.item, idx_i.token_text) > 0;

SELECT ie.id, ie.email, ie.count
FROM items i, idx_email ie
WHERE CONTAINS(i.item, '%content%') > 0
AND i.id = ie.id
ORDER BY ie.count DESC, ie.email, ie.id;
Legend
Guru Guru : 2500 - 1000000 pts
Expert Expert : 1000 - 2499 pts
Pro Pro : 500 - 999 pts
Journeyman Journeyman : 200 - 499 pts
Newbie Newbie : 0 - 199 pts
Oracle ACE Director
Oracle ACE Member
Oracle Employee ACE
Helpful Answer (5 pts)
Correct Answer (10 pts)

Point your RSS reader here for a feed of the latest messages in all forums