Archives Gone Wild! Or: when is an archive destination mandatory?
raindogOct 19 2010 — edited Oct 19 2010We had a situation where our DBs were configured with log_archive_dest_1 set to a local filesystem destination, and log_archive_dest_2 was set to a dataguard standby. dest_2 was deferred and the broker turned off due to some DR site work being done.
Everything was humming along...until due to a gross miscommunication/misunderstanding, a DBA added log_archive_dest_3, which was specified like this:
ALTER SYSTEM SET log_archive_dest_3='LOCATION=/nas/DBNAME' SCOPE=BOTH;
log_archive_dest_state_3 was already set to ENABLE and the DB immediately started trying to archive in /nas/DBNAME, which was on network-attached storage. We never architected for that, and worse, similar commands were given in other databases on the same server that were all very busy. We had four busy production servers trying to archive on the NAS. Very quickly (30-45 minutes), archiving couldn't keep up, redo filled up, and the databases stopped accepting transactions. I don't have any solid evidence (yet) that NAS performance caused the issue, but definitely we have the chain of events where each DB seized 30-45 minutes after the dest_3 change was made (and it exhausted its online redo).
The DBs include three 10.1.0.4, one 10.2.0.4, and another 9.2.0.8 on a different server that had the same problem.
In the post-mortem analysis, a question came up. If things were specified as I indicated, why wasn't dest_3 considered OPTIONAL? What I saw in the logs is a lot of inter-archiver contention, with one ARCn process waiting for another. Each DB has 10 archivers. log_archive_min_succeed_dest is set to 1 on all databases. I don't think even dest_1 is MANDATORY, though I assume min_succeed would make it so.
So here are my questions:
(1) IF you say log_archive_min_succeed_dest = 1 and have two local destinations (not DataGuard), does oracle pick dest_1 as the one that must succeed? Or does it say "as long as either of the two succeed"?
(2) I have a theory that each ARC started to try to write to the NAS and got bottlenecked. I see that they started "Creating..." but I never saw any "Completed" messages. Hence, the DB ran out of ARCs. Does that sound plausible? I know I'd need some network data or NAS data to really make that a solid case.
(3) The default is OPTIONAL for destinations, right? So why would the DB wait on the NAS? Or is the case that it simply ran out of ARCs?