|
Replies:
7
-
Pages:
1
-
Last Post:
Jan 19, 2006 8:24 PM
Last Post By: iamnoel
|
|
|
Posts:
62
Registered:
05/02/01
|
|
|
|
Including/parsing out text from within document in query..
Posted:
Jan 13, 2006 4:22 PM
|
|
|
|
I would like to know if there is a way to include text from inside a document in the data yielded by an Oracle Text CONTAINS query. I've got a table with the fields: id, item (in which 'item' is a blob including word docs and pdf files), in which case I could execute this query:
select id from items
where contains(item, '%content%', 1) > 0;
I want to then, in this case, collect the e-mail address(es) in each document yielded by the query to provide back to the client. I'd imagine I could write an expression to filter these out, illustrated not very well perhaps in this [pseudo]query:
select id, (item ~ '^.@. ..+$') into email from items
where contains(item, '%content%', 1) > 0;
In this case, 'email' would include each instance of text within the doc which conformed to the expression in the query. I'm not certain if this is even close to what I'd need to do; but is there a way to do this?
Currently installed: Oracle Database 10g Express Edition 10.2.0.1.0 - Beta
Thank you.
|
|
|
Posts:
62
Registered:
05/02/01
|
|
|
|
Re: Including/parsing out text from within document in query..
Posted:
Jan 18, 2006 7:09 AM
in response to: iamnoel
|
|
|
|
If this can't be done; I'd be interested in why - conceptually, the content's been indexed and is therefore identifiable within the database, in which case it's odd to me that I couldn't parse out certain parts of it.
Is there documentation on this which could provide insight on why I'd be incorrect on that? Thanks.
In either case, for now what I'm thinking I'll do is parse out the content of interest with xpdf/pdftotext prior to importing, than include this with the blob; but it just feels inefficient/wrong..
|
|
|
Posts:
2,840
Registered:
04/28/00
|
|
|
|
Re: Including/parsing out text from within document in query..
Posted:
Jan 18, 2006 11:21 AM
in response to: iamnoel
|
|
|
If you specify printjoins for "@" and ".", then your emails will be stored in the token_text column of dr$your_index_name$i. Otherwise an email address of person@company.org would be parsed and stored as separate tokens of person and company and org.
The biggest part of the problem is identifying which row of dr$your_index_name$i is associated with which row of your table, so that you can join them and extract the email from the correct token_text. There is a dr$your_index_name$k that contains the rowid in the testkey column that corresponds to the rowid in your table and also contains a docid that seems to correspond to the range identified by the token_first and token_last columns in the dr$your_index_name$i. The problem is that, since it is a range, if you only use the first and last for joining, then you may miss valid values inbetween and if you use the whole range, then you may get additional invalid values. So, although these things must be readily available to Oracle internally somehow, they are not readily available to the user. Please see the demonstration below that shows the problem.
For this demonstration, the contents of the word documents are as follows:
-- contents of c:\oracle\word_do1.doc:
some content from somebody@hotmail.com
--- contents of c:\oracle\word_do2.doc:
some content from anybody@whatever.org and somebody@hotmail.com
-- contents of c:\oracle\word_do3.doc:
some content from nobody@hotmail.com
-- contents of c:\oracle\word_do4.doc:
Some content from somebody@hotmail.com and everybody@yahoo.com
scott@ORA92> -- set up test data:
scott@ORA92> CREATE TABLE items
2 (id NUMBER,
3 item BLOB)
4 /
Table created.
scott@ORA92> INSERT ALL
2 INTO items (id, item) VALUES (10, EMPTY_BLOB())
3 INTO items (id, item) VALUES (20, EMPTY_BLOB())
4 INTO items (id, item) VALUES (30, EMPTY_BLOB())
5 INTO items (id, item) VALUES (40, EMPTY_BLOB())
6 SELECT * FROM DUAL
7 /
4 rows created.
scott@ORA92> CREATE OR REPLACE DIRECTORY my_dir AS 'c:\oracle'
2 /
Directory created.
scott@ORA92> DECLARE
2 v_blob BLOB;
3 v_bfile BFILE;
4 BEGIN
5 SELECT item
6 INTO v_blob
7 FROM items
8 WHERE id = 10
9 FOR UPDATE;
10 v_bfile := BFILENAME ('MY_DIR', 'word_doc1.doc');
11 DBMS_LOB.FILEOPEN (v_bfile, DBMS_LOB.FILE_READONLY);
12 DBMS_LOB.LOADFROMFILE (v_blob, v_bfile, DBMS_LOB.GETLENGTH (v_bfile));
13 DBMS_LOB.FILECLOSE (v_bfile);
14
15 SELECT item
16 INTO v_blob
17 FROM items
18 WHERE id = 20
19 FOR UPDATE;
20 v_bfile := BFILENAME ('MY_DIR', 'word_doc2.doc');
21 DBMS_LOB.FILEOPEN (v_bfile, DBMS_LOB.FILE_READONLY);
22 DBMS_LOB.LOADFROMFILE (v_blob, v_bfile, DBMS_LOB.GETLENGTH (v_bfile));
23 DBMS_LOB.FILECLOSE (v_bfile);
24
25 SELECT item
26 INTO v_blob
27 FROM items
28 WHERE id = 30
29 FOR UPDATE;
30 v_bfile := BFILENAME ('MY_DIR', 'word_doc3.doc');
31 DBMS_LOB.FILEOPEN (v_bfile, DBMS_LOB.FILE_READONLY);
32 DBMS_LOB.LOADFROMFILE (v_blob, v_bfile, DBMS_LOB.GETLENGTH (v_bfile));
33 DBMS_LOB.FILECLOSE (v_bfile);
34
35 SELECT item
36 INTO v_blob
37 FROM items
38 WHERE id = 40
39 FOR UPDATE;
40 v_bfile := BFILENAME ('MY_DIR', 'word_doc4.doc');
41 DBMS_LOB.FILEOPEN (v_bfile, DBMS_LOB.FILE_READONLY);
42 DBMS_LOB.LOADFROMFILE (v_blob, v_bfile, DBMS_LOB.GETLENGTH (v_bfile));
43 DBMS_LOB.FILECLOSE (v_bfile);
44 END;
45 /
PL/SQL procedure successfully completed.
scott@ORA92> -- create index using preference with printjoins:
scott@ORA92> EXEC CTX_DDL.CREATE_PREFERENCE ('my_lexer', 'BASIC_LEXER')
PL/SQL procedure successfully completed.
scott@ORA92> EXEC CTX_DDL.SET_ATTRIBUTE ('my_lexer', 'PRINTJOINS', '@.')
PL/SQL procedure successfully completed.
scott@ORA92> CREATE INDEX items_idx
2 ON items (item)
3 INDEXTYPE IS CTXSYS.CONTEXT
4 PARAMETERS ('LEXER my_lexer')
5 /
Index created.
scott@ORA92> -- additional data that Oracle context stores:
scott@ORA92> COLUMN token_text FORMAT A30
scott@ORA92> SELECT token_text, token_type, token_first, token_last, token_count
2 FROM dr$items_idx$i
3 /
TOKEN_TEXT TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT
------------------------------ ---------- ----------- ---------- -----------
ANYBODY@WHATEVER.ORG 0 2 2 1
CONTENT 0 1 4 4
EVERYBODY@YAHOO.COM 0 4 4 1
NOBODY@HOTMAIL.COM 0 3 3 1
SOMEBODY@HOTMAIL.COM 0 1 4 3
scott@ORA92> DESC dr$items_idx$k
Name Null? Type
----------------------------------------------------- -------- ------------------------------------
DOCID NUMBER(38)
TEXTKEY NOT NULL ROWID
scott@ORA92> SELECT * FROM dr$items_idx$k
2 /
DOCID TEXTKEY
---------- ------------------
1 AAAdx4AABAAANQ6AAA
2 AAAdx4AABAAANQ6AAB
3 AAAdx4AABAAANQ6AAC
4 AAAdx4AABAAANQ6AAD
scott@ORA92> -- queries:
scott@ORA92> SELECT id, ROWID
2 FROM items
3 WHERE CONTAINS (item, '%content%') > 0
4 ORDER BY id
5 /
ID ROWID
---------- ------------------
10 AAAdx4AABAAANQ6AAA
20 AAAdx4AABAAANQ6AAB
30 AAAdx4AABAAANQ6AAC
40 AAAdx4AABAAANQ6AAD
scott@ORA92> COLUMN email FORMAT A30
scott@ORA92> SELECT items.id,
2 dr$items_idx$i.token_text AS email
3 FROM items,
4 dr$items_idx$k,
5 dr$items_idx$i
6 WHERE CONTAINS (items.item, '%content%') > 0
7 AND dr$items_idx$i.token_text LIKE '%@%.%'
8 AND items.ROWID = dr$items_idx$k.textkey
9 -- this last join condition misses some correct results:
10 AND (dr$items_idx$k.docid = dr$items_idx$i.token_first
11 OR dr$items_idx$k.docid = dr$items_idx$i.token_last)
12 ORDER BY id, email
13 /
ID EMAIL
---------- ------------------------------
10 SOMEBODY@HOTMAIL.COM
20 ANYBODY@WHATEVER.ORG
30 NOBODY@HOTMAIL.COM
40 EVERYBODY@YAHOO.COM
40 SOMEBODY@HOTMAIL.COM
scott@ORA92> SELECT items.id,
2 dr$items_idx$i.token_text AS email
3 FROM items,
4 dr$items_idx$k,
5 dr$items_idx$i
6 WHERE CONTAINS (items.item, '%content%') > 0
7 AND dr$items_idx$i.token_text LIKE '%@%.%'
8 AND items.ROWID = dr$items_idx$k.textkey
9 -- this last join condition includes incorrect results:
10 AND dr$items_idx$k.docid
11 BETWEEN dr$items_idx$i.token_first
12 AND dr$items_idx$i.token_last
13 ORDER BY id, email
14 /
ID EMAIL
---------- ------------------------------
10 SOMEBODY@HOTMAIL.COM
20 ANYBODY@WHATEVER.ORG
20 SOMEBODY@HOTMAIL.COM
30 NOBODY@HOTMAIL.COM
30 SOMEBODY@HOTMAIL.COM
40 EVERYBODY@YAHOO.COM
40 SOMEBODY@HOTMAIL.COM
7 rows selected.
|
|
|
Posts:
2,840
Registered:
04/28/00
|
|
|
|
Re: Including/parsing out text from within document in query..
Posted:
Jan 18, 2006 11:48 AM
in response to: iamnoel
|
|
|
If your data was stored as varchar2 or clob, instead of blob, you could do something like this:
scott@ORA92> CREATE TABLE items
2 (id NUMBER,
3 item VARCHAR2(100))
4 /
Table created.
scott@ORA92> INSERT ALL
2 INTO items VALUES (1, 'some content from somebody@hotmail.com')
3 INTO items VALUES (2, 'some content from anybody@whatever.org and somebody@hotmail.com')
4 INTO items VALUES (3, 'some content from nobody@hotmail.com')
5 INTO items VALUES (4, 'Some content from somebody@hotmail.com and everybody@yahoo.com')
6 SELECT * FROM DUAL
7 /
4 rows created.
scott@ORA92> CREATE INDEX items_idx
2 ON items (item)
3 INDEXTYPE IS CTXSYS.CONTEXT
4 /
Index created.
scott@ORA92> CREATE OR REPLACE FUNCTION list_element
2 (p_string VARCHAR2,
3 p_element INTEGER,
4 p_separator VARCHAR2 DEFAULT ' ')
5 RETURN VARCHAR2
6 AS
7 v_string VARCHAR2(32767);
8 BEGIN
9 v_string := p_string || p_separator;
10 FOR i IN 1 .. p_element - 1 LOOP
11 v_string := SUBSTR (v_string,
12 INSTR (v_string, p_separator)
13 + LENGTH (p_separator));
14 END LOOP;
15 RETURN SUBSTR (v_string, 1, INSTR (v_string, p_separator) - 1);
16 END list_element;
17 /
Function created.
scott@ORA92> COLUMN email FORMAT A30
scott@ORA92> SELECT id,
2 list_element (item, element) AS email
3 FROM items,
4 (SELECT ROWNUM AS element
5 FROM DUAL CONNECT BY LEVEL < 100) -- checks first 100 words
6 WHERE CONTAINS (item, '%content%') > 0
7 AND list_element (item, element) LIKE '%@%.%'
8 ORDER BY id, email
9 /
ID EMAIL
---------- ------------------------------
1 somebody@hotmail.com
2 anybody@whatever.org
2 somebody@hotmail.com
3 nobody@hotmail.com
4 everybody@yahoo.com
4 somebody@hotmail.com
6 rows selected.
|
|
|
Posts:
62
Registered:
05/02/01
|
|
|
|
Re: Including/parsing out text from within document in query..
Posted:
Jan 19, 2006 9:16 AM
in response to: Barbara Boehmer
|
|
|
|
Intriguing..thank you for the very informative input on this. I can't come up with a way to improve on it, either..I imagine what I'll do then is execute the 1st query (which yields perhaps fewer than it ought to), provide these to the client as 'certain', then execute the 2nd query and include those not yielded by the 1st as 'not confirmed'. I can include $i.token_count to perhaps improve on this..
I'm just wondering why there is no $i.token_key = $k.textkey. I'd just be interested what the basis is..
|
|
|
Posts:
2,840
Registered:
04/28/00
|
|
|
|
Re: Including/parsing out text from within document in query..
Posted:
Jan 19, 2006 9:37 AM
in response to: iamnoel
|
|
|
I believe the following should solve the problem and provide all of the correct information, without any omissions or incorrect data.
scott@ORA92> SELECT t1.id,
2 t2.email
3 FROM items t1,
4 (SELECT items.id,
5 dr$items_idx$i.token_text AS email
6 FROM items,
7 dr$items_idx$k,
8 dr$items_idx$i
9 WHERE CONTAINS (items.item, '%content%') > 0
10 AND dr$items_idx$i.token_text LIKE '%@%.%'
11 AND items.ROWID = dr$items_idx$k.textkey
12 AND dr$items_idx$k.docid
13 BETWEEN dr$items_idx$i.token_first
14 AND dr$items_idx$i.token_last) t2
15 WHERE t1.id = t2.id
16 AND CONTAINS (t1.item, t2.email) > 0
17 ORDER BY id, email
18 /
ID EMAIL
---------- ------------------------------
10 SOMEBODY@HOTMAIL.COM
20 ANYBODY@WHATEVER.ORG
20 SOMEBODY@HOTMAIL.COM
30 NOBODY@HOTMAIL.COM
40 EVERYBODY@YAHOO.COM
40 SOMEBODY@HOTMAIL.COM
6 rows selected.
|
|
|
Posts:
2,840
Registered:
04/28/00
|
|
|
|
Re: Including/parsing out text from within document in query..
Posted:
Jan 19, 2006 9:51 AM
in response to: iamnoel
|
|
|
Now that I have thought about it a bit more, I see that the following also works, and more simply and efficiently.
scott@ORA92> SELECT items.id,
2 dr$items_idx$i.token_text AS email
3 FROM items,
4 dr$items_idx$i
5 WHERE CONTAINS (items.item, '%content%') > 0
6 AND dr$items_idx$i.token_text LIKE '%@%.%'
7 AND CONTAINS (items.item, dr$items_idx$i.token_text) > 0
8 ORDER BY id, email
9 /
ID EMAIL
---------- ------------------------------
10 SOMEBODY@HOTMAIL.COM
20 ANYBODY@WHATEVER.ORG
20 SOMEBODY@HOTMAIL.COM
30 NOBODY@HOTMAIL.COM
40 EVERYBODY@YAHOO.COM
40 SOMEBODY@HOTMAIL.COM
6 rows selected.
scott@ORA92>
|
|
|
Posts:
62
Registered:
05/02/01
|
|
|
|
Re: Including/parsing out text from within document in query..
Posted:
Jan 19, 2006 4:34 PM
in response to: Barbara Boehmer
|
|
|
|
Very nice..this works exactly as intended. I'm indebted to you, certainly. I've written this into a view I can query in code instead, in case the internal index tables change..:
CREATE VIEW idx_email (id, email, count) AS
SELECT i.id, idx_i.token_text AS email, idx_i.token_count AS count
FROM items i, dr$items_idx$i idx_i
WHERE idx_i.token_text LIKE '%@%.%'
AND CONTAINS(i.item, idx_i.token_text) > 0;
SELECT ie.id, ie.email, ie.count
FROM items i, idx_email ie
WHERE CONTAINS(i.item, '%content%') > 0
AND i.id = ie.id
ORDER BY ie.count DESC, ie.email, ie.id;
|
|
|
|
Legend
|
|
Guru : 2500
- 1000000
pts
|
|
Expert : 1000
- 2499
pts
|
|
Pro : 500
- 999
pts
|
|
Journeyman : 200
- 499
pts
|
|
Newbie : 0
- 199
pts
|
|
Oracle ACE Director
|
|
Oracle ACE Member
|
|
Oracle Employee ACE
|
|
Helpful Answer
(5 pts)
|
|
Correct Answer
(10 pts)
|
|