Aborted command 'check for resource' results in ASM failures
If anyone has any ideas, please let me know. I've been struggling with this issue for some time and haven't found anything that has helped. I'll admit upfront that my RAC and Clusterware experience is limited, so if I state anything incorrectly, please keep that in mind.
We are transitioning off a standalone RDBMS openVMS system to a Linux RAC/Grid Infrastructure configuration (a small, five node cluster, with two RDBMS servers, two job servers and a grid control node). Since upgrading from Grid Infrastruce 11.2.0.1 to 11.2.0.2 we have been plagued by periods where the nodes will encounter errors like this
2011-04-08 00:50:44.600
[ora01/11.2.0.2/gridi/bin/orarootagent.bin(11426)]CRS-5818:Aborted command 'check for resource: ora.drivers.acfs 1 1' for resource 'ora.drivers.acfs'. Details at (:CRSAGF00113:) {0:0:2} in /ora01/11.2.0.2/gridi/log/gridc/agent/ohasd/orarootagent_root/orarootagent_root.log.
2011-04-08 00:50:44.702
[ora01/11.2.0.2/gridi/bin/orarootagent.bin(11426)]CRS-5014:Agent "/ora01/11.2.0.2/gridi/bin/orarootagent.bin" timed out starting process "/ora01/11.2.0.2/gridi/bin/acfsload" for action "check": details at "(:CLSN00009:)" in "/ora01/11.2.0.2/gridi/log/gridc/agent/ohasd/orarootagent_root/orarootagent_root.log"
2011-04-08 00:51:24.904
[ora01/11.2.0.2/gridi/bin/orarootagent.bin(11426)]CRS-5832:Agent '/ora01/11.2.0.2/gridi/bin/orarootagent_root' was unable to process commands. Details at (:CRSAGF00128:) {0:0:2} in /ora01/11.2.0.2/gridi/log/gridc/agent/ohasd/orarootagent_root/orarootagent_root.log.
2011-04-08 00:52:33.340
[ora01/11.2.0.2/gridi/bin/oraagent.bin(11381)]CRS-5818:Aborted command 'check for resource: ora.asm 1 1' for resource 'ora.asm'. Details at (:CRSAGF00113:) {0:0:2} in /ora01/11.2.0.2/gridi/log/gridc/agent/ohasd/oraagent_oracle/oraagent_oracle.log.
these continue for some time, until eventually (from /var/log/messages)
Apr 8 01:00:39 gridc kernel: [Oracle ADVM] The ASM instance terminated unexpectedly. All ADVM volumes will be taken offline. You must close all applications using these volumes and unmount the file systems. After restarting the instance, you may need to re-enable the volumes for use.
Since it's ASM that fails, it affects more than this node, it impacts all database services - this seems somewhat contrary to how a high-availability system should function, when configured properly.
We had thought originally the problem was caused by us not understanding the difference between CRS managed ACFS and General purpose, and the fact we were incorrectly mounting the CRS managed resource. But since we've gone to using srvctl to mount, the problem has persisted.
Since the specific logs had information indicating the device timeout was exceeded, and the specified value was the one used for disk access, we then considered our use of RMAN backups to disk. We thought that not specifying a RATE parameter may have been the problem. But after throttling back RMAN, the problem persists. We made sure that we were not performing disk to tape backups at the specified time the errors occur.
They occurred again, this weekend, during the day. There was no scheduled activity underway and no users on the system, so I am not sure why we are getting these check for resource errors. We've applied all the 11.2.0.2 patches we can, but the problem has persisted.
I searched the forums and found one post similar to this, but with no followups. I haven't found much helpful using Google either and Oracle Support hasn't been helpful in the two months I've been trying to puzzle this out. So if anyone has any insight or help they can share, I'd be ecstatically appreciative.