Hi folks,
I spent the last few days troubleshooting an issue in a production environment where Oracle Kernel was taking all the CPU available causing some connections to fail (ORA - 03113).
Querying v$session and v$sesstat I traced the issue down to a few (15) remote services polling an events table every 5-10 seconds.
Each service uses a simple select statement, something like:
select next_event from events where event_type = ‘something’ and status = 0 order by event_id desc;
Only one index on event_id, event_type has 15 different values evenly distributed, status has 3.
Table has under 2M rows and explain plan shows a full table scan.
Running the same query on sqlplus it is blazing fast and that seems to be the case with the services as well, it is just that Oracle Kernel turns into a hog to serve it.
After adding a composite index with event_type and status Oracle.exe chilled out, CPU now at <3%.
Customer happy, issue closed.
But, I still don't know why that was necessary.
Another thing is we have a staging system which is identical to production in terms of database server version, schema, size and services connecting to it (same codebase, Devart version, etc) and we don't see Oracle Kernel going wild like in production. In fact the staging system runs on a laptop while production runs on a $100K hyper-v cluster with (allegedly) plenty of resources share with some other VMs.
I checked table fragmentation on production and looks ok to me, same as in the staging setup.
I also checked redo files switching finding nothing concerning (mostly 1-3 switches per hour).
There are no backup tools or other software running in production, besides antivirus.
show cpu_count displays the correct number of cores assigned to the VM (8).
I am afraid that adding that index and calling it a day may be hiding a more serious problem, so I am trying to understand what's going on and before I reach out to the IT team barking about their virtualization cluster I would like know if anyone here has any advise on what else I could check on the Oracle side.
And I know it could be a million things and the information I am providing is not enough for a good assessment, but perhaps there are some other basic good metrics out there I could check.
Our setup:
Oracle SDE2 19c on Windows 2019 (virtualized), microservices accessing the database run on the same cluster, different VM, codebase .NET 4.7 with Devart dotConnect for Oracle 9.15.
Pretty basic stuff used mostly for reporting, Oracle.exe gets hot (100% CPU) only with the one-liner query shown above on the events table, with nothing else hitting the database.
Again staging version of the system with all services running and other applications hitting the database, on a laptop, barely moves Oracle.exe.
Thanks,