Thread: REGEXP_ compare two tokenized strings


Permlink Replies: 19 - Pages: 2 [ 1 2 | Next ] - Last Post: Jul 16, 2007 2:57 PM Last Post By: Vadim Tropashko
-K-

Posts: 4,070
Registered: 10/24/01
REGEXP_ compare two tokenized strings
Posted: Jul 12, 2007 1:39 AM
Click to report abuse...   Click to reply to this thread Reply
Hi,

I need to compare two fields (from two tables), to see if any of the tokenized contents match.
I have two tables with a "name" field, but the names might be incomplete or in random order. To get a match, I just need 1 part of both to really match, for example "John Doe" should match to "Doe J.".
Something like:
SELECT names1.name, names2.name
FROM names1, names2
WHERE INSTR(names1.name, REGEXP_SUBSTR(names2.name, '[[:alpha:]]+')) > 0
;
But the statement should consider all tokens, not just the first one as in the example.
Any suggestion on how to achieve this?

Thanks,
K.
cd_2

Posts: 4,882
Registered: 09/08/98
Re: REGEXP_ compare two tokenized strings
Posted: Jul 12, 2007 2:04 AM   in response to: -K- in response to: -K-
Click to report abuse...   Click to reply to this thread Reply
For that you would need some sort of "loop" inside that sql statement.

C.
-K-

Posts: 4,070
Registered: 10/24/01
Re: REGEXP_ compare two tokenized strings
Posted: Jul 12, 2007 2:19 AM   in response to: cd_2 in response to: cd_2
Click to report abuse...   Click to reply to this thread Reply
Doe that mean the only option is to revert to PL/SQL?
cd_2

Posts: 4,882
Registered: 09/08/98
Re: REGEXP_ compare two tokenized strings
Posted: Jul 12, 2007 2:27 AM   in response to: -K- in response to: -K-
Click to report abuse...   Click to reply to this thread Reply
No. I meant you have to use a CONNECT BY, MODEL or even XML-query construct, here's an example:

WITH t AS (SELECT 'John Doe' name1, 'Doe J.' name2
             FROM dual
          )
SELECT name1
     , name2
     , pos
  FROM t
 MODEL
 DIMENSION BY (0 dim)
 MEASURES(name1, name2, 0 pos)
 RULES ITERATE (9) UNTIL (pos[0] > 0)
 (pos[0] = INSTR(name1[0], REGEXP_SUBSTR(name2[0], '[[:alpha:]]+')))
;    
 
NAME1    NAME2         POS
-------- ------ ----------
John Doe Doe J.          6


C.
Rob van Wijk

Posts: 5,378
Registered: 08/17/06
Re: REGEXP_ compare two tokenized strings
Posted: Jul 12, 2007 3:37 AM   in response to: -K- in response to: -K-
Click to report abuse...   Click to reply to this thread Reply
An example using model and two tables:

SQL> create table names1
2 as
3 select 1 id, 'John Doe' name from dual union all
4 select 2, 'Larry Ellison' from dual union all
5 select 3, 'Catharina-Amalia Beatrix Carmen Victoria van Oranje' from dual
6 /

Tabel is aangemaakt.

SQL> create table names2
2 as
3 select 1001 id, 'Doe J.' name from dual union all
4 select 1002, 'Gates Bill' from dual union all
5 select 1003, 'Beatrix Wilhelmina Armgard van Oranje' from dual
6 /

Tabel is aangemaakt.

SQL> select distinct n1.original_name name1
2 , n2.original_name name2
3 from ( select id
4 , name
5 , original_name
6 from names1
7 model
8 return updated rows
9 partition by (id)
10 dimension by (0 i)
11 measures (' ' || name || ' ' name, cast(null as varchar2(100)) original_name)
12 rules iterate (100) until (iteration_number = length(regexp_replace(name[0],'[^ ]')) - 2)
13 ( name[iteration_number+1] = regexp_substr(name[0],'[^ ]+',1,iteration_number+1)
14 , original_name[iteration_number+1] = trim(name[0])
15 )
16 ) n1
17 , ( select id
18 , name
19 , original_name
20 from names2
21 model
22 return updated rows
23 partition by (id)
24 dimension by (0 i)
25 measures (' ' || name || ' ' name, cast(null as varchar2(100)) original_name)
26 rules iterate (100) until (iteration_number = length(regexp_replace(name[0],'[^ ]')) - 2)
27 ( name[iteration_number+1] = regexp_substr(name[0],'[^ ]+',1,iteration_number+1)
28 , original_name[iteration_number+1] = trim(name[0])
29 )
30 ) n2
31 where n1.name = n2.name
32 /

NAME1 NAME2

---------------------------------------------------
Catharina-Amalia Beatrix Carmen Victoria van Oranje Beatrix Wilhelmina Armgard van Oranje
John Doe Doe J.

2 rijen zijn geselecteerd.

Regards,
Rob.
-K-

Posts: 4,070
Registered: 10/24/01
Re: REGEXP_ compare two tokenized strings
Posted: Jul 12, 2007 8:37 AM   in response to: Rob van Wijk in response to: Rob van Wijk
Click to report abuse...   Click to reply to this thread Reply
Thanks both for your examples of the model clause. Never used this before and took me some time to understand...

Rob, your solution works just fine. Just hope the resulting query won't be too stressful on the database (still have to condition matches on more than part of the name, of course). Hartelijk bedankt!

Regards,
K.
Joe Fuda

Posts: 227
Registered: 02/27/06
Re: REGEXP_ compare two tokenized strings
Posted: Jul 12, 2007 11:00 AM   in response to: -K- in response to: -K-
Click to report abuse...   Click to reply to this thread Reply
Perhaps I missed something, but isn't this all you need?

SQL*Plus: Release 10.2.0.1.0 - Production on Thu Jul 12 13:56:41 2007

create table n1
as
select 1 id, 'John Doe' name from dual union all
select 2, 'Larry Ellison' from dual union all
select 3, 'Catharina-Amalia Beatrix Carmen Victoria van Oranje' from dual union all
select 4, null from dual
/

Table created.

create table n2
as
select 1001 id, 'Doe J.' name from dual union all
select 1002, 'Gates Bill' from dual union all
select 1003, 'Beatrix Wilhelmina Armgard van Oranje' from dual
/

Table created.

column name wrap format a20

select n1.id, n1.name, n2.id, n2.name
from n1, n2
where
regexp_instr
( ' ' || n1.name || ' ',
' ' || replace( n2.name, ' ', ' | ' ) || ' '
) > 0
;

ID NAME ID NAME


--------------------
--------------------
1 John Doe 1001 Doe J.
3 Catharina-Amalia Bea 1003 Beatrix Wilhelmina A
trix Carmen Victoria rmgard van Oranje
van Oranje

2 rows selected.


--
Joe Fuda
SQL Snippets
Rob van Wijk

Posts: 5,378
Registered: 08/17/06
Re: REGEXP_ compare two tokenized strings
Posted: Jul 12, 2007 11:49 AM   in response to: Joe Fuda in response to: Joe Fuda
Click to report abuse...   Click to reply to this thread Reply
Perhaps I missed something, but isn't this all you
need?
...
regexp_instr
( ' ' || n1.name || ' ',
' ' || replace( n2.name, ' ', ' | ' ) || ' '
0
...

Joe,

No, I think you are not missing anything.
This looks much better, although I'm not sure about performance.
To be continued ... ;-)

Regards,
Rob.
cd_2

Posts: 4,882
Registered: 09/08/98
Re: REGEXP_ compare two tokenized strings
Posted: Jul 12, 2007 2:16 PM   in response to: Joe Fuda in response to: Joe Fuda
Click to report abuse...   Click to reply to this thread Reply
erased Looks ok to me, after a second read. Further testing when I'm awake. ;-)

C.
Rob van Wijk

Posts: 5,378
Registered: 08/17/06
Re: REGEXP_ compare two tokenized strings
Posted: Jul 12, 2007 2:22 PM   in response to: Joe Fuda in response to: Joe Fuda
Click to report abuse...   Click to reply to this thread Reply
The performance of the regexp_instr in the where clause is terrible ... I stopped working out other scenarios, because the conclusion will be the same:

SQL> create table names1
2 as
3 select 1 id, 'John Doe' name from dual union all
4 select 2, 'Larry Ellison' from dual union all
5 select 3, 'Catharina-Amalia Beatrix Carmen Victoria van Oranje' from dual union all
6 select object_id, o
7 from ( select object_id
8 , owner || object_name o
9 , ntile(200) over (order by object_id) nt
10 from all_objects
11 )
12 where nt = 11
13 /

Tabel is aangemaakt.

SQL> create table names2
2 as
3 select 1001 id, 'Doe J.' name from dual union all
4 select 1002, 'Gates Bill' from dual union all
5 select 1003, 'Beatrix Wilhelmina Armgard van Oranje' from dual union all
6 select object_id, o
7 from ( select object_id
8 , owner || object_name o
9 , ntile(200) over (order by object_id) nt
10 from all_objects
11 )
12 where nt = 22
13 /

Tabel is aangemaakt.

SQL> select count(*) from names1
2 /

COUNT(*)

241

1 rij is geselecteerd.

SQL> select count(*) from names2
2 /

COUNT(*)

241

1 rij is geselecteerd.

SQL> exec dbms_stats.gather_table_stats(user,'names1')

PL/SQL-procedure is geslaagd.

SQL> exec dbms_stats.gather_table_stats(user,'names2')

PL/SQL-procedure is geslaagd.

SQL> set timing on
SQL> set autotrace on
SQL> select distinct n1.original_name name1
2 , n2.original_name name2
3 from ( select id
4 , name
5 , original_name
6 from names1
7 model
8 return updated rows
9 partition by (id)
10 dimension by (0 i)
11 measures (' ' || name || ' ' name, cast(null as varchar2(100)) original_name)
12 rules iterate (100) until (iteration_number = length(regexp_replace(name[0],'[^ ]')) - 2)
13 ( name[iteration_number+1] = regexp_substr(name[0],'[^ ]+',1,iteration_number+1)
14 , original_name[iteration_number+1] = trim(name[0])
15 )
16 ) n1
17 , ( select id
18 , name
19 , original_name
20 from names2
21 model
22 return updated rows
23 partition by (id)
24 dimension by (0 i)
25 measures (' ' || name || ' ' name, cast(null as varchar2(100)) original_name)
26 rules iterate (100) until (iteration_number = length(regexp_replace(name[0],'[^ ]')) - 2)
27 ( name[iteration_number+1] = regexp_substr(name[0],'[^ ]+',1,iteration_number+1)
28 , original_name[iteration_number+1] = trim(name[0])
29 )
30 ) n2
31 where n1.name = n2.name
32 /

NAME1

NAME2

Catharina-Amalia Beatrix Carmen Victoria van Oranje
Beatrix Wilhelmina Armgard van Oranje

John Doe
Doe J.

2 rijen zijn geselecteerd.

Verstreken: 00:00:00.07
Uitvoeringspan


0 SELECT STATEMENT Optimizer=CHOOSE (Cost=8 Card=581 Bytes=987
70)

1 0 SORT (UNIQUE) (Cost=8 Card=581 Bytes=98770)
2 1 HASH JOIN (Cost=7 Card=581 Bytes=98770)
3 2 VIEW (Cost=3 Card=241 Bytes=20485)
4 3 BUFFER (SORT)
5 4 SQL MODEL (ORDERED FAST)
6 5 TABLE ACCESS (FULL) OF 'NAMES1' (TABLE) (Cost=3
Card=241 Bytes=5784)

7 2 VIEW (Cost=3 Card=241 Bytes=20485)
8 7 BUFFER (SORT)
9 8 SQL MODEL (ORDERED FAST)
10 9 TABLE ACCESS (FULL) OF 'NAMES2' (TABLE) (Cost=3
Card=241 Bytes=5784)


Statistics


387 recursive calls
0 db block gets
52 consistent gets
0 physical reads
0 redo size
590 bytes sent via SQL*Net to client
508 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
11 sorts (memory)
0 sorts (disk)
2 rows processed

SQL> select n1.name name1
2 , n2.name name2
3 from names1 n1
4 , names2 n2
5 where regexp_instr ( ' ' || n1.name || ' ', ' ' || replace( n2.name, ' ', ' | ' ) || ' ' ) > 0
6 /

NAME1 NAME2

------------------------------------------------------------
John Doe Doe J.
Catharina-Amalia Beatrix Carmen Victoria van Oranje Beatrix Wilhelmina Armgard van Oranje

2 rijen zijn geselecteerd.

Verstreken: 00:00:46.54
Uitvoeringspan

0 SELECT STATEMENT Optimizer=CHOOSE (Cost=218 Card=2904 Bytes=
121968)

1 0 NESTED LOOPS (Cost=218 Card=2904 Bytes=121968)
2 1 TABLE ACCESS (FULL) OF 'NAMES1' (TABLE) (Cost=3 Card=241
Bytes=5061)

3 1 TABLE ACCESS (FULL) OF 'NAMES2' (TABLE) (Cost=1 Card=12
Bytes=252)


Statistics


1 recursive calls
0 db block gets
728 consistent gets
0 physical reads
0 redo size
590 bytes sent via SQL*Net to client
508 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
2 rows processed

Regards,
Rob.
cd_2

Posts: 4,882
Registered: 09/08/98
Re: REGEXP_ compare two tokenized strings
Posted: Jul 12, 2007 2:30 PM   in response to: Rob van Wijk in response to: Rob van Wijk
Click to report abuse...   Click to reply to this thread Reply
Interesting, I wouldn't have thought that MODEL is really that fast.

C.
Vadim Tropashko

Posts: 505
Registered: 01/10/01
Re: REGEXP_ compare two tokenized strings
Posted: Jul 12, 2007 2:40 PM   in response to: -K- in response to: -K-
Click to report abuse...   Click to reply to this thread Reply
select n1.id, n1.name, n2.id, n2.name
from n1, n2
where
regexp_instr
( ' ' || n1.name || ' ',
' ' || replace( n2.name, ' ', ' | ' ) || ' '
) > 0
;

Very nice, Joe!

The fact that you can do such impressive things with RegExp is not surprising at all, because there is rich Language Parsing theory that underlines it. Anyway, here is continuation of the theme:
http://vadimtropashko.wordpress.com/2007/07/12/string-decomposition-with-regexp/
cd_2

Posts: 4,882
Registered: 09/08/98
Re: REGEXP_ compare two tokenized strings
Posted: Jul 12, 2007 2:45 PM   in response to: Vadim Tropashko in response to: Vadim Tropashko
Click to report abuse...   Click to reply to this thread Reply
Regarding Model clause, could anybody please suggest
a single problem that can be solved with it (I
mean where it is not merely "yet another" ugly
solution)?

The documentation shows some examples. Ugly? I think
in some cases MODEL offers better readability than
alternative solutions, but that's just me.

C.
-K-

Posts: 4,070
Registered: 10/24/01
Re: REGEXP_ compare two tokenized strings
Posted: Jul 13, 2007 2:14 AM   in response to: Joe Fuda in response to: Joe Fuda
Click to report abuse...   Click to reply to this thread Reply
Aaah, this is what I was looking for!
In the end it was so simple, I can't believe I didn't came up with this myself. A combination of keeping your head clear, focused and open for different perspectives.

Thanks for switching on the light ;-)
K.
-K-

Posts: 4,070
Registered: 10/24/01
Re: REGEXP_ compare two tokenized strings
Posted: Jul 13, 2007 2:26 AM   in response to: Rob van Wijk in response to: Rob van Wijk
Click to report abuse...   Click to reply to this thread Reply
It's obvious the model clause performs much better, in spite of the regexp_ rules vs. just the one regexp_like.
Nevertheless, 47 seconds seem just impossible. Something else must be interfering here. I get similar return times between the two solutions, not even near the abyss you got... weird.

K.
Legend
Guru Guru : 2500 - 1000000 pts
Expert Expert : 1000 - 2499 pts
Pro Pro : 500 - 999 pts
Journeyman Journeyman : 200 - 499 pts
Newbie Newbie : 0 - 199 pts
Oracle ACE Director
Oracle ACE Member
Oracle Employee ACE
Helpful Answer (5 pts)
Correct Answer (10 pts)

Point your RSS reader here for a feed of the latest messages in all forums