|
Replies:
19
-
Pages:
2
[
1
2
| Next
]
-
Last Post:
Jul 16, 2007 2:57 PM
Last Post By: Vadim Tropashko
|
|
|
Posts:
4,070
Registered:
10/24/01
|
|
|
|
REGEXP_ compare two tokenized strings
Posted:
Jul 12, 2007 1:39 AM
|
|
|
|
Hi,
I need to compare two fields (from two tables), to see if any of the tokenized contents match.
I have two tables with a "name" field, but the names might be incomplete or in random order. To get a match, I just need 1 part of both to really match, for example "John Doe" should match to "Doe J.".
Something like:
SELECT names1.name, names2.name
FROM names1, names2
WHERE INSTR(names1.name, REGEXP_SUBSTR(names2.name, '[[:alpha:]]+')) > 0
;
But the statement should consider all tokens, not just the first one as in the example.
Any suggestion on how to achieve this?
Thanks,
K.
|
|
|
Posts:
4,882
Registered:
09/08/98
|
|
|
|
Re: REGEXP_ compare two tokenized strings
Posted:
Jul 12, 2007 2:04 AM
in response to: -K-
|
|
|
|
For that you would need some sort of "loop" inside that sql statement.
C.
|
|
|
Posts:
4,070
Registered:
10/24/01
|
|
|
|
Re: REGEXP_ compare two tokenized strings
Posted:
Jul 12, 2007 2:19 AM
in response to: cd_2
|
|
|
|
Doe that mean the only option is to revert to PL/SQL?
|
|
|
Posts:
4,882
Registered:
09/08/98
|
|
|
|
Re: REGEXP_ compare two tokenized strings
Posted:
Jul 12, 2007 2:27 AM
in response to: -K-
|
|
|
No. I meant you have to use a CONNECT BY, MODEL or even XML-query construct, here's an example:
WITH t AS (SELECT 'John Doe' name1, 'Doe J.' name2
FROM dual
)
SELECT name1
, name2
, pos
FROM t
MODEL
DIMENSION BY (0 dim)
MEASURES(name1, name2, 0 pos)
RULES ITERATE (9) UNTIL (pos[0] > 0)
(pos[0] = INSTR(name1[0], REGEXP_SUBSTR(name2[0], '[[:alpha:]]+')))
;
NAME1 NAME2 POS
-------- ------ ----------
John Doe Doe J. 6
C.
|
|
|
Posts:
5,378
Registered:
08/17/06
|
|
|
|
Re: REGEXP_ compare two tokenized strings
Posted:
Jul 12, 2007 3:37 AM
in response to: -K-
|
|
|
An example using model and two tables:
SQL> create table names1
2 as
3 select 1 id, 'John Doe' name from dual union all
4 select 2, 'Larry Ellison' from dual union all
5 select 3, 'Catharina-Amalia Beatrix Carmen Victoria van Oranje' from dual
6 /
Tabel is aangemaakt.
SQL> create table names2
2 as
3 select 1001 id, 'Doe J.' name from dual union all
4 select 1002, 'Gates Bill' from dual union all
5 select 1003, 'Beatrix Wilhelmina Armgard van Oranje' from dual
6 /
Tabel is aangemaakt.
SQL> select distinct n1.original_name name1
2 , n2.original_name name2
3 from ( select id
4 , name
5 , original_name
6 from names1
7 model
8 return updated rows
9 partition by (id)
10 dimension by (0 i)
11 measures (' ' || name || ' ' name, cast(null as varchar2(100)) original_name)
12 rules iterate (100) until (iteration_number = length(regexp_replace(name[0],'[^ ]')) - 2)
13 ( name[iteration_number+1] = regexp_substr(name[0],'[^ ]+',1,iteration_number+1)
14 , original_name[iteration_number+1] = trim(name[0])
15 )
16 ) n1
17 , ( select id
18 , name
19 , original_name
20 from names2
21 model
22 return updated rows
23 partition by (id)
24 dimension by (0 i)
25 measures (' ' || name || ' ' name, cast(null as varchar2(100)) original_name)
26 rules iterate (100) until (iteration_number = length(regexp_replace(name[0],'[^ ]')) - 2)
27 ( name[iteration_number+1] = regexp_substr(name[0],'[^ ]+',1,iteration_number+1)
28 , original_name[iteration_number+1] = trim(name[0])
29 )
30 ) n2
31 where n1.name = n2.name
32 /
NAME1 NAME2
---------------------------------------------------
Catharina-Amalia Beatrix Carmen Victoria van Oranje Beatrix Wilhelmina Armgard van Oranje
John Doe Doe J.
2 rijen zijn geselecteerd.
Regards,
Rob.
|
|
|
Posts:
4,070
Registered:
10/24/01
|
|
|
|
Re: REGEXP_ compare two tokenized strings
Posted:
Jul 12, 2007 8:37 AM
in response to: Rob van Wijk
|
|
|
|
Thanks both for your examples of the model clause. Never used this before and took me some time to understand...
Rob, your solution works just fine. Just hope the resulting query won't be too stressful on the database (still have to condition matches on more than part of the name, of course). Hartelijk bedankt!
Regards,
K.
|
|
|
Posts:
227
Registered:
02/27/06
|
|
|
|
Re: REGEXP_ compare two tokenized strings
Posted:
Jul 12, 2007 11:00 AM
in response to: -K-
|
|
|
Perhaps I missed something, but isn't this all you need?
SQL*Plus: Release 10.2.0.1.0 - Production on Thu Jul 12 13:56:41 2007
create table n1
as
select 1 id, 'John Doe' name from dual union all
select 2, 'Larry Ellison' from dual union all
select 3, 'Catharina-Amalia Beatrix Carmen Victoria van Oranje' from dual union all
select 4, null from dual
/
Table created.
create table n2
as
select 1001 id, 'Doe J.' name from dual union all
select 1002, 'Gates Bill' from dual union all
select 1003, 'Beatrix Wilhelmina Armgard van Oranje' from dual
/
Table created.
column name wrap format a20
select n1.id, n1.name, n2.id, n2.name
from n1, n2
where
regexp_instr
( ' ' || n1.name || ' ',
' ' || replace( n2.name, ' ', ' | ' ) || ' '
) > 0
;
ID NAME ID NAME
--------------------
--------------------
1 John Doe 1001 Doe J.
3 Catharina-Amalia Bea 1003 Beatrix Wilhelmina A
trix Carmen Victoria rmgard van Oranje
van Oranje
2 rows selected.
--
Joe Fuda
SQL Snippets
|
|
|
Posts:
5,378
Registered:
08/17/06
|
|
|
|
Re: REGEXP_ compare two tokenized strings
Posted:
Jul 12, 2007 11:49 AM
in response to: Joe Fuda
|
|
|
Perhaps I missed something, but isn't this all you
need?
...
regexp_instr
( ' ' || n1.name || ' ',
' ' || replace( n2.name, ' ', ' | ' ) || ' '
0
...
Joe,
No, I think you are not missing anything.
This looks much better, although I'm not sure about performance.
To be continued ...
Regards,
Rob.
|
|
|
Posts:
4,882
Registered:
09/08/98
|
|
|
|
Re: REGEXP_ compare two tokenized strings
Posted:
Jul 12, 2007 2:16 PM
in response to: Joe Fuda
|
|
|
erased Looks ok to me, after a second read. Further testing when I'm awake.
C.
|
|
|
Posts:
5,378
Registered:
08/17/06
|
|
|
|
Re: REGEXP_ compare two tokenized strings
Posted:
Jul 12, 2007 2:22 PM
in response to: Joe Fuda
|
|
|
The performance of the regexp_instr in the where clause is terrible ... I stopped working out other scenarios, because the conclusion will be the same:
SQL> create table names1
2 as
3 select 1 id, 'John Doe' name from dual union all
4 select 2, 'Larry Ellison' from dual union all
5 select 3, 'Catharina-Amalia Beatrix Carmen Victoria van Oranje' from dual union all
6 select object_id, o
7 from ( select object_id
8 , owner || object_name o
9 , ntile(200) over (order by object_id) nt
10 from all_objects
11 )
12 where nt = 11
13 /
Tabel is aangemaakt.
SQL> create table names2
2 as
3 select 1001 id, 'Doe J.' name from dual union all
4 select 1002, 'Gates Bill' from dual union all
5 select 1003, 'Beatrix Wilhelmina Armgard van Oranje' from dual union all
6 select object_id, o
7 from ( select object_id
8 , owner || object_name o
9 , ntile(200) over (order by object_id) nt
10 from all_objects
11 )
12 where nt = 22
13 /
Tabel is aangemaakt.
SQL> select count(*) from names1
2 /
COUNT(*)
241
1 rij is geselecteerd.
SQL> select count(*) from names2
2 /
COUNT(*)
241
1 rij is geselecteerd.
SQL> exec dbms_stats.gather_table_stats(user,'names1')
PL/SQL-procedure is geslaagd.
SQL> exec dbms_stats.gather_table_stats(user,'names2')
PL/SQL-procedure is geslaagd.
SQL> set timing on
SQL> set autotrace on
SQL> select distinct n1.original_name name1
2 , n2.original_name name2
3 from ( select id
4 , name
5 , original_name
6 from names1
7 model
8 return updated rows
9 partition by (id)
10 dimension by (0 i)
11 measures (' ' || name || ' ' name, cast(null as varchar2(100)) original_name)
12 rules iterate (100) until (iteration_number = length(regexp_replace(name[0],'[^ ]')) - 2)
13 ( name[iteration_number+1] = regexp_substr(name[0],'[^ ]+',1,iteration_number+1)
14 , original_name[iteration_number+1] = trim(name[0])
15 )
16 ) n1
17 , ( select id
18 , name
19 , original_name
20 from names2
21 model
22 return updated rows
23 partition by (id)
24 dimension by (0 i)
25 measures (' ' || name || ' ' name, cast(null as varchar2(100)) original_name)
26 rules iterate (100) until (iteration_number = length(regexp_replace(name[0],'[^ ]')) - 2)
27 ( name[iteration_number+1] = regexp_substr(name[0],'[^ ]+',1,iteration_number+1)
28 , original_name[iteration_number+1] = trim(name[0])
29 )
30 ) n2
31 where n1.name = n2.name
32 /
NAME1
NAME2
Catharina-Amalia Beatrix Carmen Victoria van Oranje
Beatrix Wilhelmina Armgard van Oranje
John Doe
Doe J.
2 rijen zijn geselecteerd.
Verstreken: 00:00:00.07
Uitvoeringspan
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=8 Card=581 Bytes=987
70)
1 0 SORT (UNIQUE) (Cost=8 Card=581 Bytes=98770)
2 1 HASH JOIN (Cost=7 Card=581 Bytes=98770)
3 2 VIEW (Cost=3 Card=241 Bytes=20485)
4 3 BUFFER (SORT)
5 4 SQL MODEL (ORDERED FAST)
6 5 TABLE ACCESS (FULL) OF 'NAMES1' (TABLE) (Cost=3
Card=241 Bytes=5784)
7 2 VIEW (Cost=3 Card=241 Bytes=20485)
8 7 BUFFER (SORT)
9 8 SQL MODEL (ORDERED FAST)
10 9 TABLE ACCESS (FULL) OF 'NAMES2' (TABLE) (Cost=3
Card=241 Bytes=5784)
Statistics
387 recursive calls
0 db block gets
52 consistent gets
0 physical reads
0 redo size
590 bytes sent via SQL*Net to client
508 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
11 sorts (memory)
0 sorts (disk)
2 rows processed
SQL> select n1.name name1
2 , n2.name name2
3 from names1 n1
4 , names2 n2
5 where regexp_instr ( ' ' || n1.name || ' ', ' ' || replace( n2.name, ' ', ' | ' ) || ' ' ) > 0
6 /
NAME1 NAME2
------------------------------------------------------------
John Doe Doe J.
Catharina-Amalia Beatrix Carmen Victoria van Oranje Beatrix Wilhelmina Armgard van Oranje
2 rijen zijn geselecteerd.
Verstreken: 00:00:46.54
Uitvoeringspan
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=218 Card=2904 Bytes=
121968)
1 0 NESTED LOOPS (Cost=218 Card=2904 Bytes=121968)
2 1 TABLE ACCESS (FULL) OF 'NAMES1' (TABLE) (Cost=3 Card=241
Bytes=5061)
3 1 TABLE ACCESS (FULL) OF 'NAMES2' (TABLE) (Cost=1 Card=12
Bytes=252)
Statistics
1 recursive calls
0 db block gets
728 consistent gets
0 physical reads
0 redo size
590 bytes sent via SQL*Net to client
508 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
2 rows processed
Regards,
Rob.
|
|
|
Posts:
4,882
Registered:
09/08/98
|
|
|
|
Re: REGEXP_ compare two tokenized strings
Posted:
Jul 12, 2007 2:30 PM
in response to: Rob van Wijk
|
|
|
|
Interesting, I wouldn't have thought that MODEL is really that fast.
C.
|
|
|
Posts:
505
Registered:
01/10/01
|
|
|
|
Re: REGEXP_ compare two tokenized strings
Posted:
Jul 12, 2007 2:40 PM
in response to: -K-
|
|
|
select n1.id, n1.name, n2.id, n2.name
from n1, n2
where
regexp_instr
( ' ' || n1.name || ' ',
' ' || replace( n2.name, ' ', ' | ' ) || ' '
) > 0
;
Very nice, Joe!
The fact that you can do such impressive things with RegExp is not surprising at all, because there is rich Language Parsing theory that underlines it. Anyway, here is continuation of the theme:
http://vadimtropashko.wordpress.com/2007/07/12/string-decomposition-with-regexp/
|
|
|
Posts:
4,882
Registered:
09/08/98
|
|
|
|
Re: REGEXP_ compare two tokenized strings
Posted:
Jul 12, 2007 2:45 PM
in response to: Vadim Tropashko
|
|
|
Regarding Model clause, could anybody please suggest
a single problem that can be solved with it (I
mean where it is not merely "yet another" ugly
solution)?
The documentation shows some examples. Ugly? I think
in some cases MODEL offers better readability than
alternative solutions, but that's just me.
C.
|
|
|
Posts:
4,070
Registered:
10/24/01
|
|
|
|
Re: REGEXP_ compare two tokenized strings
Posted:
Jul 13, 2007 2:14 AM
in response to: Joe Fuda
|
|
|
Aaah, this is what I was looking for!
In the end it was so simple, I can't believe I didn't came up with this myself. A combination of keeping your head clear, focused and open for different perspectives.
Thanks for switching on the light 
K.
|
|
|
Posts:
4,070
Registered:
10/24/01
|
|
|
|
Re: REGEXP_ compare two tokenized strings
Posted:
Jul 13, 2007 2:26 AM
in response to: Rob van Wijk
|
|
|
|
It's obvious the model clause performs much better, in spite of the regexp_ rules vs. just the one regexp_like.
Nevertheless, 47 seconds seem just impossible. Something else must be interfering here. I get similar return times between the two solutions, not even near the abyss you got... weird.
K.
|
|
|
|
Legend
|
|
Guru : 2500
- 1000000
pts
|
|
Expert : 1000
- 2499
pts
|
|
Pro : 500
- 999
pts
|
|
Journeyman : 200
- 499
pts
|
|
Newbie : 0
- 199
pts
|
|
Oracle ACE Director
|
|
Oracle ACE Member
|
|
Oracle Employee ACE
|
|
Helpful Answer
(5 pts)
|
|
Correct Answer
(10 pts)
|
|