Oracle 11.2.0.4 running on an Exadata box.
This is not really a tuning question, I have already re-written the posted query to return in a reasonable amount of time. I realize that the query as written is not particularly efficient, but I would expect it to return results, which it doesn't, at least not in finite time.
There are three tables involved:
Table1 holds information about things. The data comes from multiple sources and each source has its own ID for that thing (sourceid), in many cases these things represent the same entity. There is a, theoretically, unique common identifier that can identify the entity across all of the sources (commonid), but this information is not always available. When it is, we use that to link together the "same" entity across the various sources and assign the same id. When it is not present, we use a combination of other attributes to link the entities together.
There was a bug in the linking code which resulted in many entities being incorrectly linked together. Prior to unlinking the inappropriately linked entities we created the two other tables which have the ID, source and sourceid of the affected entities at the time of unlinking.
The query is intended to find all of the source/sourceid combinations where there are multiple commonid values in the same linked set, that we have not already identified (hence the not in table1/table2 construct).
The query created by the developer was:
select * from table1
where id in (select r1.id
from table1 r1, table1 r2
where r1.id = r2.id and
r1.commonid != r2.commonid and
r1.commonid is not null and
r2.commonid is not null and
r1.id not in (select r.id
from table1 r, table2 u
where r.source = u.source and
r.sourceid = u.sourceid) and
r1.id not in (select r.id
from table1 r, table3 u
where r.source = u.source and
r.sourceid = u.sourceid))
which resulted in the following plan:
----------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 10M| 2058M| 949M (1)|999:59:59 |
|* 1 | FILTER | | | | | |
| 2 | TABLE ACCESS STORAGE FULL | TABLE1 | 10M| 2058M| 71307 (1)| 00:14:16 |
| 3 | NESTED LOOPS ANTI | | 1 | 46 | 15 (0)| 00:00:01 |
| 4 | NESTED LOOPS | | 1 | 44 | 6 (0)| 00:00:01 |
|* 5 | TABLE ACCESS BY INDEX ROWID | TABLE1 | 1 | 22 | 3 (0)| 00:00:01 |
|* 6 | INDEX RANGE SCAN | ID_INDEX | 1 | | 2 (0)| 00:00:01 |
|* 7 | HASH JOIN | | 1 | 37 | 77 (2)| 00:00:01 |
| 8 | TABLE ACCESS BY INDEX ROWID | TABLE1 | 2 | 44 | 5 (0)| 00:00:01 |
|* 9 | INDEX RANGE SCAN | ID_INDEX | 2 | | 3 (0)| 00:00:01 |
| 10 | TABLE ACCESS STORAGE FULL FIRST ROWS| TABLE2 | 57379 | 840K| 71 (0)| 00:00:01 |
|* 11 | TABLE ACCESS BY INDEX ROWID | TABLE1 | 1 | 22 | 3 (0)| 00:00:01 |
|* 12 | INDEX RANGE SCAN | ID_INDEX | 1 | | 2 (0)| 00:00:01 |
| 13 | VIEW PUSHED PREDICATE | VW_NSO_1 | 1 | 2 | 9 (0)| 00:00:01 |
|* 14 | HASH JOIN | | 1 | 36 | 9 (0)| 00:00:01 |
| 15 | TABLE ACCESS BY INDEX ROWID | TABLE1 | 2 | 44 | 5 (0)| 00:00:01 |
|* 16 | INDEX RANGE SCAN | ID_INDEX | 2 | | 3 (0)| 00:00:01 |
| 17 | TABLE ACCESS STORAGE FULL FIRST ROWS | TABLE3 | 818 | 11452 | 4 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter( EXISTS (SELECT /*+ <not feasible>)
5 - filter("R1"."COMMONID" IS NOT NULL)
6 - access("R1"."ID"=:B1)
filter( NOT EXISTS (SELECT 0 FROM "TABLE2" "U","TABLE1" "R" WHERE
"R"."ID"=:B1 AND "R"."SOURCEID"="U"."SOTRCEID" AND "R"."SOURCE"="U"."SOURCE"))
7 - access("R"."SOURCE"="U"."SOURCE" AND "R"."SOURCEID"="U"."SOURCEID")
9 - access("R"."ID"=:B1)
11 - filter("R2"."COMMONID" IS NOT NULL AND "R1"."COMMONID"<>"R2"."COMMONID")
12 - access("R2"."ID"=:B1)
filter("R1"."ID"="R2"."ID")
14 - access("R"."SOURCE"="U"."SOURCE" AND "R"."SOURCEID"="U"."SOURCEID")
16 - access("R"."ID"="R1"."ID")
The row count estimates for all of the table scans are bang on, I am a little dubious about the expected rows from the index accesses on table1, but they are not hugely off, I would expect something more like in the tens of rows on average.
What does puzzle me is the massive increase in time and cost between line 2 and line 0. I am also puzzled by the presence of a filter operation at line 1, and particularly by the predicate shown against it "filter( EXISTS (SELECT /*+ <not feasible>)". Have we made the optimizer give up?
Any thoughts would be appreciated, and if more information would help, I can try to get it.
Thanks
John