Perhaps this has always been crystal clear to all but me. On the other hand, it is also possible that I am not alone in this boat.
The problem is to find a short(er) string as a substring of a longer one. More specifically, the problem is to find ALL occurrences of a short(er) string as a substring of a longer one. Even more specifically, the question is - what does “ALL” mean? Are the “matches” required to be non-overlapping, or are we really asking for ALL occurrences?
The most trivial example is to find all occurrences of AA within the string AAA. How many matches are there?
The documentation for INSTR is explicit. Quote:
Each consecutive compared substring of string
begins one character to the right (for forward searches) or one character to the left (for backward searches) from the first character of the previous compared substring.
And the documentation doesn't lie. The first match (shown in bold) is AAA, and there is a second match, AAA. And indeed:
select instr('aaa', 'aa', 1, 2) as position_of_second_match from dual;
POSITION_OF_SECOND_MATCH
------------------------
2
There is a second match, starting at the second character of the “searched” string.
Regarding REGEXP_INSTR, the documentation says
REGEXP_INSTR
extends the functionality of the INSTR
function […]
That is simply false - specifically as regards the n-th occurrence where n > 1. This simple statement in the documentation has caused me, for almost eight years, to believe that INSTR, like REGEXP_INSTR, starts the “next” search at the first character after a match, if a match has been found. That is, “occurrences” are defined to be non-overlapping. This is, in fact, the behavior of REGEXP_INSTR - and it is in direct contradiction to the INSTR behavior, it does not “extend the functionality of” …
select regexp_instr('aaa', 'aa', 1, 2) as position_of_second_match from dual;
POSITION_OF_SECOND_MATCH
------------------------
0
Motivation:
My interest in INSTR was renewed today when a question appeared on this forum, but then disappeared after an hour or so. Too bad; the problem, while pretty trivial, was interesting in one respect. It proposed two tables - one with 1 million rows, each storing a long(er) string of digits, and a second table with 10,000 rows storing shorter strings. The task was to find all “matches” where a longer string (from the first table) contains a shorter one (from the smaller second table). I just happened to be reading through the different algorithms for matching strings, over the past few days, when this question appeared. What is interesting here is performance - and it seems to me that it is a perfect example of a problem where plain SQL stands no chance against a PL/SQL solution (or one written in C or Java and wrapped within PL/SQL). The best algorithms build a finite automaton, or a “partial match” table, etc., for the shorter string. Even if INSTR is implemented using one of these algorithms, I don't see how it would save one of these machines (one for each “searched-for” substring) just once, to be used against all “searched” strings. Calling INSTR for every pair in a SQL query means that this prep work will be done 1 million times for each substring, instead of just once.
That's moot for now, but for one thing, as I was reading through the various algorithms, I made a note that they all search for ALL matches, including overlapping ones. Which, after testing right now, I know that INSTR does. Which motivated me to write this post.