Oracle SE2 19.18.0
OS: Linux x86 64-bit
We have a performance issue for a query that occurred after regularly gathered statistics. The query went from subsecond to minutes. The issue was tracked to a part in the execution plan that shows an incorrect hash join cardinality estimate of 1 which resulted in subsequent NL joins in the next steps.
Specifically line 73 here has an estimate of 1 when actual is not. When the next CARTESIAN step also shows estimate of 1, this became highly suspect the selectivity is <<1.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| E-Time | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
...
| 33 | VIEW | VW_CALENDAR_FORMS | 1 | 237K| 222M| | 3710 (1)| 00:00:01 | 147K|00:04:29.06 | 142M| | | |
| 34 | TEMP TABLE TRANSFORMATION | | 1 | | | | | | 147K|00:04:29.03 | 142M| | | |
| 35 | LOAD AS SELECT (CURSOR DURATION MEMORY) | SYS_TEMP_0FDA5BA5C_A4082E5C | 1 | | | | | | 0 |00:00:00.40 | 2616 | 1024 | 1024 | |
|* 36 | HASH JOIN | | 1 | 142K| 21M| | 1358 (1)| 00:00:01 | 147K|00:00:00.32 | 2615 | 1068K| 1068K| 1349K (0)|
| 37 | VIEW | index$_join$_073 | 1 | 399 | 13566 | | 5 (0)| 00:00:01 | 399 |00:00:00.01 | 20 | | | |
|* 38 | HASH JOIN | | 1 | | | | | | 399 |00:00:00.01 | 20 | 1106K| 1106K| 1683K (0)|
| 39 | INDEX FAST FULL SCAN | UK_EDC_FORM_NAME_VERSION | 1 | 399 | 13566 | | 5 (0)| 00:00:01 | 399 |00:00:00.01 | 16 | | | |
| 40 | INDEX FAST FULL SCAN | PK_EDC_FORM | 1 | 399 | 13566 | | 1 (0)| 00:00:01 | 399 |00:00:00.01 | 4 | | | |
|* 41 | HASH JOIN | | 1 | 142K| 16M| | 1352 (1)| 00:00:01 | 147K|00:00:00.26 | 2595 | 1995K| 1995K| 1691K (0)|
| 42 | TABLE ACCESS FULL | CALENDAR | 1 | 146 | 1022 | | 6 (0)| 00:00:01 | 151 |00:00:00.01 | 14 | | | |
|* 43 | HASH JOIN | | 1 | 146K| 16M| | 1346 (1)| 00:00:01 | 147K|00:00:00.23 | 2581 | 1300K| 1300K| 1521K (0)|
| 44 | TABLE ACCESS FULL | CALENDAR_FORM | 1 | 4616 | 180K| | 19 (0)| 00:00:01 | 4616 |00:00:00.01 | 60 | | | |
|* 45 | HASH JOIN | | 1 | 146K| 10M| | 1326 (1)| 00:00:01 | 147K|00:00:00.17 | 2521 | 1922K| 1922K| 1737K (0)|
| 46 | TABLE ACCESS FULL | EDC_FORM_USAGE | 1 | 4618 | 41562 | | 14 (0)| 00:00:01 | 4616 |00:00:00.01 | 45 | | | |
|* 47 | HASH JOIN | | 1 | 146K| 9611K| 3304K| 1312 (1)| 00:00:01 | 147K|00:00:00.12 | 2476 | 9122K| 2395K| 8130K (0)|
| 48 | TABLE ACCESS FULL | EDC_FORM_INSTANCE | 1 | 146K| 1578K| | 262 (1)| 00:00:01 | 147K|00:00:00.01 | 961 | | | |
| 49 | TABLE ACCESS FULL | CALENDAR_FORM_INSTANCE | 1 | 147K| 8062K| | 414 (1)| 00:00:01 | 147K|00:00:00.03 | 1515 | | | |
...
| 69 | NESTED LOOPS | | 1 | 1 | 604 | | 949 (1)| 00:00:01 | 4335 |00:04:28.27 | 142M| | | |
| 70 | NESTED LOOPS | | 1 | 1 | 604 | | 949 (1)| 00:00:01 | 4335 |00:04:28.24 | 142M| | | |
| 71 | NESTED LOOPS | | 1 | 1 | 567 | | 948 (1)| 00:00:01 | 4335 |00:04:28.22 | 142M| | | |
| 72 | MERGE JOIN CARTESIAN | | 1 | 1 | 553 | | 947 (1)| 00:00:01 | 142M|00:00:32.92 | 14 | | | |
|* 73 | HASH JOIN | | 1 | 1 | 540 | | 829 (1)| 00:00:01 | 4335 |00:00:00.08 | 14 | 1856K| 1856K| 1692K (0)|
| 74 | TABLE ACCESS FULL | TIMEPOINT_FORM | 1 | 1158 | 10422 | | 6 (0)| 00:00:01 | 1177 |00:00:00.01 | 14 | | | |
|* 75 | VIEW | | 1 | 141K| 71M| | 823 (1)| 00:00:01 | 4335 |00:00:00.06 | 0 | | | |
| 76 | TABLE ACCESS FULL | SYS_TEMP_0FDA5BA5C_A4082E5C | 1 | 141K| 16M| | 823 (1)| 00:00:01 | 147K|00:00:00.04 | 0 | | | |
| 77 | BUFFER SORT | | 4335 | 32779 | 416K| | 947 (1)| 00:00:01 | 142M|00:00:19.53 | 0 | 1186K| 567K| 1054K (0)|
| 78 | VIEW | | 1 | 32779 | 416K| | 118 (1)| 00:00:01 | 32883 |00:00:00.01 | 0 | | | |
| 79 | TABLE ACCESS FULL | SYS_TEMP_0FDA5BA5D_A4082E5C | 1 | 32779 | 2784K| | 118 (1)| 00:00:01 | 32883 |00:00:00.01 | 0 | | | |
|* 80 | TABLE ACCESS BY INDEX ROWID | TIMEPOINT_INSTANCE | 142M| 1 | 14 | | 1 (0)| 00:00:01 | 4335 |00:03:30.00 | 142M| | | |
|* 81 | INDEX UNIQUE SCAN | PK_TIMEPOINT_INSTANCE_ID | 142M| 1 | | | 0 (0)| | 142M|00:01:11.05 | 1718 | | | |
|* 82 | INDEX UNIQUE SCAN | PK_TIMEPOINTS_ID | 4335 | 1 | | | 0 (0)| | 4335 |00:00:00.02 | 990 | | | |
| 83 | TABLE ACCESS BY INDEX ROWID | TIMEPOINT | 4335 | 1 | 37 | | 1 (0)| 00:00:01 | 4335 |00:00:00.02 | 4335 | | | |
...
32 - access("ESI"."SECTION_ELEMENT_ID"="FE"."FORM_ELEMENT_ID")
36 - access("EFU"."FORM_ID"="EF"."FORM_ID")
38 - access(ROWID=ROWID)
41 - access("CF"."CALENDAR_ID"="CAL"."CALENDAR_ID")
43 - access("CF"."EDC_FORM_USAGE_ID"="EFU"."FORM_USAGE_ID")
45 - access("EFI"."FORM_USAGE_ID"="EFU"."FORM_USAGE_ID")
47 - access("CFI"."FORM_INSTANCE_ID"="EFI"."FORM_INSTANCE_ID")
51 - access("VI"."VISIT_ID"="V"."VISIT_ID")
...
60 - access("CFI"."VISIT_FORM_ID"="VF"."VISIT_FORM_ID")
62 - filter("CFI"."TYPE"='VISIT_SCHEDULED')
64 - access("CVI"."VISIT_INSTANCE_ID"="CFI"."VISIT_INSTANCE_ID")
67 - filter("CFI"."TYPE"='VISIT_ADDITIONAL')
73 - access("CFI"."TIMEPOINT_FORM_ID"="TF"."TIMEPOINT_FORM_ID")
75 - filter("CFI"."TYPE"='TIMEPOINT')
80 - filter("CVI"."VISIT_INSTANCE_ID"="TI"."VISIT_INSTANCE_ID")
81 - access("TI"."TIMEPOINT_INSTANCE_ID"="CFI"."TIMEPOINT_INSTANCE_ID")
82 - access("TI"."TIMEPOINT_ID"="T"."TIMEPOINT_ID")
84 - access("CFI"."ONGOING_FORM_ID"="ONG"."ONGOING_FORM_ID")
Since this was an effect of statistic changes the tables involved were scrutinized for what could possibly have caused the plan change. Everything was considered and tried: histograms, low-hi values, etc.
We were able to replicate the issue on a TEST system with different data. So this does not seem to be anything specific to this schema and data. We saved the stats in a stattab on the TEST system. Gathering new stats was able to change back to the better execution plan. We saved the new stats in a statstab. Importing the old stats reproduces the problem plan. So it's clearly a stats problem. The question is, what stat is causing the problem?
After gathering stats on all the tables in the problematic part of the plan individually and retesting, the table causing the issue was found. CALENDAR.
Each column stat for the table was imported individually and retested. None of the column stats or histograms changed anything.
Lastly we tried to import the table stats from the statstab. (Note: dbms_stats.import_table_stats(…cascade=>false) did not work to import just the table stats. Column stats were also imported. So instead we resulted to using dbms_stats.set_table_stats.)
BEGIN
DBMS_STATS.set_table_stats (
ownname => 'schema_test'
, tabname => 'calendar'
, numrows => 54
, no_invalidate => false
);
END;
/
The original numrows value was 50. While the number was not exact, it was not that far off that it should make that much difference.
This is where the real question is.
Merely setting the table stat “numrows” to a slightly larger value changes the selectivity Oracle uses from 0 to a non-zero value resulting in drastically different execution plans.
I realize I have not provided the entire SQL and full execution plan. But has anyone ever encountered this before? This was entirely a process of elimination. I would never has suspected the table statistic numrows changed from 50 to 54 (actually even 52) would result in such a dramatic change. It seems like a bug.
The 10053 traces of both scenarios shows the different selectivities being used.
Even more interesting is that the table identified with the numrows change is CALENDAR which is used in the materialized CTE composing the SRC and the cardinality for SRC in the trace is correct. Just the join selectivity goes awry when changing that numrows value. Also, the join column TIMEPOINT_FORM_ID is not on the CALENDAR table.
***********************
Join order[2]: SRC[SRC]#1 TIMEPOINT_FORM[TF]#0
***************
Now joining: TIMEPOINT_FORM[TF]#0
***************
NL Join
Outer table: Card: 20524.000000 Cost: 112.423710 Resp: 112.423710 Degree: 1 Bytes:
Access path analysis for TIMEPOINT_FORM
Scan IO Cost (Disk) = 4.333415
Scan CPU Cost (Disk) = 150243.040000
Cost of predicates:
io = NOCOST, cpu = 50.000000, sel = 0.000000 flag = 2048 ("SRC"."TIMEPOINT_FORM_ID"="TF"."TIMEPOINT_FORM_ID")
...
Join Card: 0.000000 = outer (20524.000000) * inner (110.000000) * sel (0.000000)
Join Card - Rounded: 1 Computed: 0.000000
***************
Now joining: TIMEPOINT_FORM[TF]#0
***************
NL Join
Outer table: Card: 21755.000000 Cost: 118.449024 Resp: 118.449024 Degree: 1 Bytes:
Access path analysis for TIMEPOINT_FORM
Scan IO Cost (Disk) = 4.333395
Scan CPU Cost (Disk) = 150243.040000
Cost of predicates:
io = NOCOST, cpu = 50.000000, sel = 0.000227 flag = 2048 ("SRC"."TIMEPOINT_FORM_ID"="TF"."TIMEPOINT_FORM_ID")
...
Join Card: 543.000000 = outer (21755.000000) * inner (110.000000) * sel (2.2691e-04)
Join Card - Rounded: 543 Computed: 543.000000
I can try to provide more information if needed.
Tuning the SQL is not the issue. But merely changing the numrows stat from 50 to 54 does not seem like it should affect the optimizer in this way. It is as if that data is at an inflection point causing the CBO to choose a different calculation.
Wondering if anyone has experienced this before. My internet and MOS searches have not resulted in anything.
I realize this will probably wind up with an Oracle SR but any ideas are appreciated.
Thanks
Andrew