Oracle 12.1 on AIX, single instance
This is large (30 TB) and busy database running mix of OLPT and batch. Commit rate is low (60 per second), avg wait on log file sync is 20 - 30 ms. Normally waits on LFS are less than 5% of total waits. Today for a brief period LFS waits became very slow:
- Statspack repor for 20 min interval shows avg wait on LFS = 303 ms. No other abnormal waits during this interval
- Waits on log file parallel write remained normal at 11 ms
- Other metrics were normal: transactions per second = 60, redo per transaction = 50K
- No significant changes to the logical reads (1 million per sec), physical reads (100K per sec), etc
Looking at active session history I can see that normally we get less than 5 sessions waiting on LFS. Today during 1 min interval the number of sessions waiting on LFS exceeded 600. It happened out of the blue, there were no ORA-00600 or ORA-07445. At the LPAR level CPU and memory utilization remained normal. There is a trace file for log writer in /trace, but it hasn't been updated for 20 hours.
Histogram distribution:
Total ----------------- % of Waits ------------------
Event Waits <1ms <2ms <4ms <8ms <16ms <32ms <=1s >1s
-------------------------- ----- ----- ----- ----- ----- ----- ----- ----- -----
log file sync 82K .3 1.0 4.0 10.2 22.4 35.6 25.2 1.3
This database has two underscore parameters set to non-default values:
_use_adaptive_log_file_sync = FALSE
-use_single_log_writer = TRUE
I had a look in ASH trying to determine the event LGWR was waiting for: it didn't get populated for the interval
when LFS waits were slow.
Looking at v$session_event for LGWR session: seems healthy
EVENT TOTAL_WAITS TIME_WAITED_MICRO
---------------------------------------- ----------- -----------------
direct path write 3 12969
control file single write 2332 11430662
Disk file operations I/O 3337 20695575
latch: redo writing 28868 27397622
log file sequential read 55915 198817936
log file single write 55915 199806989
control file parallel write 4083621 1.7887E+10
events in waitclass Other 49319169 1.8559E+10
control file sequential read 13262044 3.1336E+10
rdbms ipc message 773063332 3.5595E+12
log file parallel write 1002356527 4.2436E+12
What could have possible caused it? what should we do if it happens again?