Hi,
Oracle documentation, such as
https://docs.oracle.com/cd/B19306_01/B14251_01/adfns_regexp.htm
says that *? "matches the empty string whenever possible", but it doesn't seem to match the empty string in this case:
CREATE TABLE table_x
( x_id NUMBER (6) PRIMARY KEY
, str VARCHAR2 (50) NOT NULL
);
INSERT INTO table_x (x_id, str) VALUES (1, 'b');
INSERT INTO table_x (x_id, str) VALUES (2, 'a b');
INSERT INTO table_x (x_id, str) VALUES (3, 'c,dd,,ef g h');
SELECT *
FROM table_x x
CROSS APPLY (
SELECT LEVEL AS n
, REGEXP_SUBSTR ( x.str
, '(.*?)(,|$)'
, 1
, LEVEL
, NULL
, 1
) AS item
, REGEXP_INSTR ( x.str
, '(.*?)(,|$)'
, 1
, LEVEL
) AS pos
FROM dual
CONNECT BY LEVEL <= REGEXP_COUNT (x.str, '(.*?)(,|$)') -- clause 1
-- CONNECT BY LEVEL <= REGEXP_COUNT (x.str, '(,|$)') -- clause 2
) c
ORDER BY x_id, c.n
;
The output I get is:
X_ID STR N ITEM POS
----- -------------------- ----- ---------- -----
1 b 1 b 1
1 b 2 2
2 a b 1 a b 1
2 a b 2 4
3 c,dd,,ef g h 1 c 1
3 c,dd,,ef g h 2 dd 3
3 c,dd,,ef g h 3 6
3 c,dd,,ef g h 4 ef g h 7
3 c,dd,,ef g h 5 13
I expected item to be NULL in all cases, since nothing (that is, an empty string) right before a comma (or end-of-string) is a match (as confirmed by the output above when x_id=3 and n=3). The matches aren't as greedy as they could be; that is, I see (and understand) different results when I use * instead of *? , but I don't see why, for example, in x_id=2, the 3 character sub-string 'a b' is considered a match, when the 1-character substring 'b' would be less greedy, and the empty string would be less greedy still.
Also, REGEXP_COUNT is returning N+1 when I expect N. For example, in the row with x_id=1, there are 0 commas and 1 end-of-string, so I expected REGEXP_COUNT to return 1, not 2. I expected both CONNECT BY clauses above to produce the same results, but clause 1 returns a value 1 greater than I expected, while clause 2 does just what I expected.
I'm using Oracle 12.2.0.1.0. I get the same results in version 12.1.0.1.0.
Can someone explain this behavior? Thanks!