Skip to Main Content

SQL & PL/SQL

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Non-Greedy Regular Expression (.*?) Seems Slightly Greedy

Frank KulashJan 2 2018 — edited Jan 9 2018

Hi,

Oracle documentation, such as

https://docs.oracle.com/cd/B19306_01/B14251_01/adfns_regexp.htm

says that   *?  "matches the empty string whenever possible", but it doesn't seem to match the empty string in this case:

CREATE TABLE  table_x
(  x_id   NUMBER (6)      PRIMARY KEY
,  str    VARCHAR2 (50)  NOT NULL
);

INSERT INTO  table_x (x_id, str) VALUES (1, 'b');
INSERT INTO  table_x (x_id, str) VALUES (2, 'a b');
INSERT INTO  table_x (x_id, str) VALUES (3, 'c,dd,,ef g h');

SELECT       *
FROM         table_x  x
CROSS APPLY  (  
                 SELECT  LEVEL            AS n
                 ,       REGEXP_SUBSTR ( x.str 
                                       , '(.*?)(,|$)'
                                       , 1
                                       , LEVEL
                                       , NULL
                                       , 1
                                       )  AS item
                 ,       REGEXP_INSTR ( x.str
                                      , '(.*?)(,|$)'
                                      , 1
                                      , LEVEL
                                      )  AS pos
                 FROM    dual
                 CONNECT BY  LEVEL  <= REGEXP_COUNT (x.str, '(.*?)(,|$)')  -- clause 1
--               CONNECT BY  LEVEL  <= REGEXP_COUNT (x.str, '(,|$)')       -- clause 2
             )        c
ORDER BY     x_id, c.n
;

The output I get is:

X_ID STR                      N ITEM         POS

----- -------------------- ----- ---------- -----

    1 b                        1 b              1

    1 b                        2                2

    2 a b                      1 a b            1

    2 a b                      2                4

    3 c,dd,,ef g h             1 c              1

    3 c,dd,,ef g h             2 dd             3

    3 c,dd,,ef g h             3                6

    3 c,dd,,ef g h             4 ef g h         7

    3 c,dd,,ef g h             5               13

I expected item to be NULL in all cases, since nothing (that is, an empty string) right before a comma (or end-of-string) is a match (as confirmed by the output above when x_id=3 and n=3).  The matches aren't as greedy as they could be; that is, I see (and understand) different results when I use   *   instead of   *?   , but I don't see why, for example, in x_id=2, the 3 character sub-string 'a b' is considered a match, when the 1-character substring 'b' would be less greedy, and the empty string would be less greedy still.

Also, REGEXP_COUNT is returning N+1 when I expect N.  For example, in the row with x_id=1, there are 0 commas and 1 end-of-string, so I expected REGEXP_COUNT to return 1, not 2.  I expected both CONNECT BY clauses above to produce the same results, but clause 1 returns a value 1 greater than I expected, while clause 2 does just what I expected.

I'm using Oracle 12.2.0.1.0.  I get the same results in version 12.1.0.1.0.

Can someone explain this behavior?  Thanks!

This post has been answered by BluShadow on Jan 4 2018
Jump to Answer
Comments
Locked Post
New comments cannot be posted to this locked post.
Post Details
Locked on Feb 6 2018
Added on Jan 2 2018
38 comments
5,136 views