Hi I am using version 11.2.0.4.0 of oracle.
we are having one reporting query which normally completes in an hour, but in some specific days, its running for more than 1hr in production DB, and recently whenever its exceeding 1hr+ DBA's killed the query+flushing shared pool and asked to trigger the report again, and its getting completed in an hour.
But when i ran below query i see, only one plan hash value and plan associated in the AWR and also confirmed from the DBA'a that the execution plan remains same even after flushing the shared pool, and the query completes smoothly, so i dont think there is any justification behind flushing the shared pool, its just that some DBA suggested to do the same, and we are doing it that way.
select plan_table_output from table(dbms_xplan.display_awr('123c6wxrdufjs')) ;
----------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
----------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 927K(100)| | | |
| 1 | TABLE ACCESS BY LOCAL INDEX ROWID | TAB1 | 1 | 63 | 687 (1)| 00:00:09 | | |
| 2 | NESTED LOOPS | | 1 | 237 | 927K (1)| 03:05:28 | | |
| 3 | NESTED LOOPS | | 1 | 174 | 926K (1)| 03:05:20 | | |
| 4 | MERGE JOIN CARTESIAN | | 918K| 124M| 7627 (1)| 00:01:32 | | |
| 5 | TABLE ACCESS BY INDEX ROWID | TAB2 | 1 | 14 | 1 (0)| 00:00:01 | | |
| 6 | INDEX UNIQUE SCAN | TAB2_U01 | 1 | | 0 (0)| | | |
| 7 | PARTITION LIST ALL | | 918K| 112M| 7626 (1)| 00:01:32 | 1 | 6 |
| 8 | TABLE ACCESS FULL | TAB3 | 918K| 112M| 7626 (1)| 00:01:32 | 1 | 6 |
| 9 | PARTITION LIST SINGLE | | 1 | 32 | 1 (0)| 00:00:01 | 1 | 1 |
| 10 | TABLE ACCESS BY LOCAL INDEX ROWID| TAB4 | 1 | 32 | 1 (0)| 00:00:01 | 1 | 1 |
| 11 | INDEX UNIQUE SCAN | TAB4_U01 | 1 | | 1 (0)| 00:00:01 | 1 | 1 |
| 12 | PARTITION RANGE ITERATOR | | 1 | | 687 (1)| 00:00:09 | KEY | KEY |
| 13 | INDEX RANGE SCAN | TAB1_IX1 | 1 | | 687 (1)| 00:00:09 | KEY | KEY |
----------------------------------------------------------------------------------------------------------------------------------------
I didnt see any unusual waits during this point of time.
i have just digged into the AWR data out of the DBA_HIST_SQLSTAT and DBA_HIST_ACTIVE_SESS_HISTORY, to get the cause behind it, and where its actually spending much of the time. i did it for last 4days where it ran smoothly, and i found , its spending most of the time in plan_line_id-13, followed by plan_line_id-1, which is INDEX RANGE SCAN of TAB1_IX1.
then i went through the physical I/O section, and found it has done considerable amount of physical read on 2nd may when ran in first attempt, and it second attempt, it ran with less I/O. i ran below query for last 4 days
select * from
(
select sql_id,sql_plan_hash_value,event,sql_exec_id, sql_exec_start,current_obj#,sql_plan_line_id, sql_plan_operation,sql_plan_options,
SUM (delta_read_io_requests) lio_read ,
SUM (delta_read_io_bytes) pio_read ,
count(*) count_1
from dba_hist_active_sess_history where sql_id='123c6wxrdufjs'
and snap_id >= :strt_snap
and snap_id <= :end_snap
group by sql_id,sql_plan_hash_value,event,sql_exec_id, sql_exec_start,current_obj#,sql_plan_line_id, sql_plan_operation,sql_plan_options
)
order by count_1 desc
Using above query i got the below physical i/o statistics for the lan_line_id-13, where its has spent most time and done maximum physical I/O
date physical_read_bytes_line_id_13 report_query_started report_Query_ended
30- apr-2016 2312749056 04/30/2016 2:24:29 AM 4/30/2016 3:15:16.617 AM
29-apr-2016 1260756992 04/29/2016 2:21:26 AM 4/29/2016 3:00:51.074 AM
28-ar-2016 1050910720 04/28/2016 2:06:50 AM 4/28/2016 2:45:31.520 AM
1-may-2016 2848907264 5/1/2016 1:58:13.45 AM 5/1/2016 3:00:14.442 AM
2-may-2016(1st run) 40593858560 5/2/2016 1:21:12.00 AM 5/2/2016 2:45:30.964 AM(query killed)
2-may-2016(2nd run) 2542837760 5/2/2016 4:34:12.00 AM 5/2/2016 5:30:54.540 AM
Now i am not able to understand, if this is due to the fact that index blocks has been aged out of the buffer cache and that is why index range scan at line no - 13 was taking considerable amount of time in first attempt? If my assumption is correct, how to cater this scenario?