I noticed this some time ago, and I thought I posted about it either here or on stackoverflow... but I just searched for it and I wasn't able to find anything. Perhaps my recollection is wrong (about posting about it), or my searching skills are getting poorer.
It seems to me - see a few examples below - that Oracle has an implementation bug when it comes to mixing non-greedy and greedy quantifiers.
Questions:
1. Am I interpreting the regular expressions incorrectly? (Meaning, is the observed behavior correct?)
2. If the behavior I describe below is incorrect, is this a known bug or issue? Does it depend on version? (I get the same results on 12.1.0.2 and 12.2.0.1)
3. If this is a bug: Does Oracle know about it? Are they planning to fix it?
with
inputs ( str ) as (
select 'axaybzb' from dual
)
select str,
regexp_substr(str, 'a.*?a.*b') as res_1,
regexp_substr(str, 'ax*?a.*b') as res_2
from inputs;
STR RES_1 RES_2
------- ------- -------
axaybzb axayb axaybzb
In this example, RES_1 is incorrect; it should be the same as RES_2. In the computation of the first result, Oracle behaves as if the second quantifier is non-greedy, like the first. WORSE, Oracle is inconsistent about this: if the first quantifier is still non-greedy, but it quantifies a literal letter x rather than the dot, the behavior changes: the greedy quantifier following the dot behaves correctly.
I won't post the other tests (you can easily replicate them yourselves, or you can take my word for it): If you change the non-greedy quantifier to +? instead of *? in both calculations, the result will be correct in both cases. HOWEVER, if you do that and you also change the greedy quantifier from * to + in both calculations, then RES_1 will again be incorrect.
Similarly: If in the original test (shown in full above), in RES_1, we replace the dot with [^s] in one place but we keep the dot in the other, we get the correct answer. But if we replace the dot with [^s] in both places, the result is wrong again.
I thought in some tests in the past I was able to find cases when Oracle calculates incorrectly an expression where .* is followed by .*? , greedy followed by non-greedy, but I wasn't able to replicate that right now.
MATCH_RECOGNIZE
Oracle introduced the MATCH_RECOGNIZE clause in 12.1, which has the PATTERN clause that uses some of the concepts and notation from regular expressions. The following test translates essentially the same case from regular expressions to MATCH_RECOGNIZE - testing only the RES_1 expression which gives the wrong answer for regular expression string functions. The result for MATCH_RECOGNIZE is the right one.
with
inputs ( id, str ) as (
select 1, 'a' from dual union all
select 2, 'x' from dual union all
select 3, 'a' from dual union all
select 4, 'y' from dual union all
select 5, 'b' from dual union all
select 6, 'z' from dual union all
select 7, 'b' from dual
)
select id, str, cls
from inputs
match_recognize(
order by id
measures classifier() as cls
all rows per match
pattern ( a f*? a f* b )
define a as str = 'a'
);
ID STR CLS
-- --- ---
1 a A
2 x F
3 a A
4 y F
5 b F
6 z F
7 b B