Skip to Main Content

SQL & PL/SQL

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Performance comparison between SUBSTR/INSTR and REGEXP equivalent

mathguyDec 15 2016 — edited Dec 17 2016

This came from a question posted on Stack Overflow. The OP there had the following problem. Suppose we are given a large table, table_z, with two columns, data1 and data2. Both are varchar2(50). Currently the second column is all NULL. The assignment is to take the data from the data1 column, process it, and insert the result in the data2 column. I don't know WHY the OP had that question, but to me that is irrelevant.

The processing that was required was as follows:

  • FIRST, check the string data1 for the first occurrence of the substring 'PLE', starting the search at the SECOND character of data1. If it is found, remove 'PLE' and everything that follows it. Return what's left from the original data1.
  • THEN, if the substring 'PLE' wasn't found anywhere in data1 (except perhaps as the first three characters), search data1 again, this time for the substring '#' (a single "number sign"), again starting the search from the second character. If found, remove it and everything that follows it from the string, and return the "truncated" string from data1.
  • THEN, if neither 'PLE' nor '#' were found (except perhaps at the very beginning of data1), do the same with the substring 'ALL'
  • FINALLY, if none of the substrings above was found (starting from the second character), don't remove anything; return data1 as is.

This is pretty easy to do, either using SUBSTR/INSTR or using regular expressions. Two solutions were offered, and I decided to test them both for performance. I have heard many times in the past that SUBSTR/INSTR should have better performance, but I didn't expect what I saw. The outperformance was by a factor of 40. Or - scratch that; as I was creating this post, I figured out already one way to improve the REGEXP solution and the outperformance is now only by a factor of 15.

Asking fellow participants to this "Community" - is this in line with similar performance comparisons you have seen, or are aware of? Or is this a huge outlier for whatever reason (and what might that reason be)? It seems, at a logical level, that the two string manipulations are quite similar (see them below). Or - am I missing something in that regard, and the REGEXP solution can be made to perform much better?

Setup: I created a table with 1.5 million random strings, all exactly eight characters long, all upper-case letters. Then I modified 10% of the strings to add the substring 'PLE', another 10% to add a '#' and another 10% to add 'ALL'. I did this by splitting an original string at position  mod(rownum, 9) - that is a number between 0 and 8 - and concatenating 'PLE' or '#' or 'ALL' at that position. Granted, not the most efficient or elegant way to get the kind of test data we needed, but that is irrelevant - the point is just to create the test data and use it in our tests.

So: we now have a table with just one column, data1, with some random strings in 1.5 million rows. 10% each have the substring PLE or # or ALL in them. The test consists in creating the new string data2 as in the original post. I am not inserting the result back in the table; regardless of how data2 is calculated, the time to insert it back in the table should be the same. Instead, I put the main query inside an outer one that computes the sum of the lengths of the resulting data2 values. This way I guarantee the optimizer can't take shortcuts: all data2 values must be generated, their lengths must be measured, and then summed together.

Table creation:

create table table_z as
  select dbms_random.string('U', 8) as data1 from dual
  connect by level <= 1500000;

update table_z
set data1 = case
    when rownum between      1 and 150000 then substr(data1, 1, mod(rownum, 9)) || 'PLE' || substr(data1, mod(rownum, 9) + 1)
    when rownum between 150001 and 300000 then substr(data1, 1, mod(rownum, 9)) || '#'   || substr(data1, mod(rownum, 9) + 1)
    when rownum between 300001 and 450000 then substr(data1, 1, mod(rownum, 9)) || 'ALL' || substr(data1, mod(rownum, 9) + 1)
   
end
where rownum <= 450000;

commit;

SUBSTR/INSTR solution:

select sum(length(data2))
from (
select data1,
      
case when instr(data1, 'PLE', 2) > 0 then substr(data1, 1, instr(data1, 'PLE', 2) - 1)
           
when instr(data1, '#'  , 2) > 0 then substr(data1, 1, instr(data1, '#'  , 2) - 1)
           
when instr(data1, 'ALL', 2) > 0 then substr(data1, 1, instr(data1, 'ALL', 2) - 1)
           
else data1 end
      
as data2
from   table_z
);

SUM
(LENGTH(DATA2))
------------------
         
10713352

1 row selected.


Elapsed
: 00:00:00.73

REGEXP solution:

select sum(length(data2))
from (
select data1,
      
COALESCE(REGEXP_SUBSTR(DATA1, '^(.+?)PLE',1,1,null,1)  

                                      --  the solution offered on SO was missing the ^ anchor and ran in 30-35 seconds
                           
,REGEXP_SUBSTR(DATA1, '^(.+?)#',1,1,null,1)
                           
,REGEXP_SUBSTR(DATA1, '^(.+?)ALL',1,1,null,1)
                           
,DATA1)
      
as data2
from   table_z
);

SUM
(LENGTH(DATA2))
------------------
         
10713352

1 row selected.

Elapsed
: 00:00:10:29

This post has been answered by Paulzip on Dec 16 2016
Jump to Answer
Comments
Locked Post
New comments cannot be posted to this locked post.
Post Details
Locked on Jan 14 2017
Added on Dec 15 2016
21 comments
11,052 views