SQL & PL/SQL

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Benchmarking: start-of-group, tabibitosan, and match_recognize

mathguySep 28 2016 — edited Sep 29 2016

As promised in another thread, I performed some tests on the three methods in the title (actually four as you shall see soon). Sorry it took so long - I had to read about match_recognize first, since most of the time I use Oracle 11.

"Gaps and Islands" is a very common type of problem. In its simplest form, values in a numeric column are positive integers, with no duplicates. The problem (or an important part of it) is to group the rows based on "consecutive values" in this column. That is, a group consists of rows with consecutive values in this column, and each group has the maximum possible number of rows consistent with this condition. Example: values 2, 3, 4, 7, 9, 11, 12, 13, 15. Groups: 2-3-4, 7, 9, 11-12-13, 15.

There are at least three "standard" ways to identify the groups. All of them start by loading all the rows from disk and ordering them by the value in this column (let's call the column "id"). Then:

In the Start-of-Group method, a new column "flag" is created. Using the lag() function, the flag is set to 1 whenever id != lag(id); otherwise "flag" is null. In a second pass over the rows, a cumulative sum(flag) is calculated, and this sum(flag) is used to identify the groups (the "islands" in the "gaps and islands" in the id column).
In the Tabibitosan method, row_number() over (order by id) is subtracted from id; this diference, in one pass over the rows (instead of two passes), already identifies/distinguishes the groups.
In the Match_Recognize method, maximal "groupings" of rows corresponding exactly to the "islands" are identified, and the system-provided sequential value MATCH_NUMBER() can be used for the grouping.
However, if this is all we use match-recognize for, it turns out to be an inefficient way to generate the goupings - almost as slow as the Start-of-Group method. Happily, in match_recognize we can do more work than just "identifying the pattern-matching groupings" in one pass.

I will describe the test below; but let me state the conclusion right away: The tabibitosan method proved to be the fastest, working in well less than half the time required by both the start-of-group method and the "simplistic" application of match_recognize (used just to generate the groups). Even with the best algorithm I was able to think of for match_recognize, it was still 13% slower than the tabibitosan method.

It is possible that I didn't use the best approach for one or another or any of the methods; if you see an improvement, please share it and I will test it.

The problem I used for testing:

Suppose we have a table "a" with 12 million rows and two numeric columns: "id" and "val". Neither includes any null values. The id's are distinct integers between 1 and 20,000,000; id could be PK, but I did NOT define any keys (primary or unique), constraints, or indexes on the table. (I don't think that makes a difference since the entire table is read anyway.) The table is stored on disk. The values are randomly generated decimal numbers between 0 and 1. I created the table as follows: First I created a set with 20 million rows, with id generated in the usual manner through a cross join of two smaller tables of numbers created with "connect by level...". The values are random numbers, created using dbms_random.value(). Then I ordered by val and in an outer query I selected "where rownum <= 12000000." In the original 20 million rows about half should have val < 0.5, and all should be included in the 12 million row table. So for a sanity check I ran "select count(*) from a where val > 0.5" - the answer should be about 2 million (it turned out to be 2,002,363) - and it ran in 0.288 seconds, showing that reading the data from disk takes almost no time (compared to the other times, shared below) - even with having to read all 12 million rows.

Then, I set out to do the following: Identify the "islands" (group rows so that in each group the id's are consecutive, and the groups are maximal with this property). Then in each group compute the sum of "val", and count how many groups have this sum greater than 4. Don't ask about 4, it is absolutely arbitrary; for the random numbers I generated, the correct answer to this problem (all four queries shown below return the same answer) was 10,709.

The first three strategies identify the groups first, in a CTE, and then group by the group marker and compute the sum(val) within each group, keeping only those with the sum greater than 4, and then the outermost query counts how many rows remain. That is 10,709 in all cases.

In the better application of MATCH_RECOGNIZE, once the "maximal patterns" (and corresponding groups) are identified, the sum(val) can be calculated at the same time, in the MEASURES clause; actually I compute a "flag" (CASE expression) which is 1 when the sum is > 4 and null otherwise, and in the SELECT clause I simply select count(flag). So it seems this approach should go over the 12 million rows just once, and do all the processing in memory. This approach still takes longer than the tabibitosan method - I don't know why, but it does. Perhaps I didn't use the best method?

Below is the script I used to generate the base table "a", then the code of the four solutions, and finally the results. Cheers, mathguy

drop table a purge;
create table a ( id number, val number);
insert into a
  select id, val
  from ( select 4000 * t1.x + t2.y as id, dbms_random.value() as val
         from   (select level - 1 as x from dual connect by level <= 5000) t1 
                cross join
                (select level     as y from dual connect by level <= 4000) t2
         order by val
       )
  where rownum <= 12000000 order by id;
commit;
select count(*) from a where val > 0.5;                                    -- 2002363 in 0.288 seconds   
-- Additional sanity check (output not shown for brevity, but it looked OK):
select id, val from a where id between 7440390 and 7440450 order by id;

-- METHOD: Start_of_Group
with
     prep ( id, val, grp ) as (
       select id, val, sum(sog) over (order by id) 
       from ( select id, val, case when lag(id) over (order by id) != id - 1 then 1 end sog
              from   a
            )
     )
select count(*) as ct
from  ( select grp from prep group by grp having sum(val) > 4 )
;
-- METHOD: Tabibitosan
with
     prep ( id, val, grp ) as (
       select id, val, id - row_number() over (order by id)
       from   a
     )
select count(*) as ct
from  ( select grp from prep group by grp having sum(val) > 4 )
;
-- METHOD: MR_w_Group_By
with
     prep ( id, val, grp ) as (
       select id, v, grp
       from   a
         match_recognize (
           order by id
           measures match_number() as grp, val as v
           all rows per match
           pattern (u v*)
           define v as id = prev(id) + 1
         )
     )
select count(*) as ct
from  ( select grp from prep group by grp having sum(val) > 4 )
;
-- Method: Match_Recognize
select count(flag) as ct
from   a
  match_recognize (
    order by id
    measures case when sum(val) > 4 then 1 end as flag
    one row per match
    pattern (u v*)
    define v as id = prev(id) + 1
  )
;

Test_no Start_of_Group Tabibitosan MR_w_Group_By Match_Recognize

1 70.6 25.2 67.8 30.3

2 70.0 26.1 64.9 31.2

3 70.0 28.6 65.1 30.1

4 73.6 28.4 63.8 31.9

5 71.9 28.3 67.4 31.0

6 71.1 27.1 66.8 32.9

7 71.8 28.3 68.2 31.6

8 71.6 27.1 67.9 30.4

AVERAGE 71.3 27.4 66.5 31.2

Locked Post

New comments cannot be posted to this locked post.

Locked on Oct 27 2016

Added on Sep 28 2016

17 comments

2,824 views