media failure scenarios and ASM
485303Nov 17 2007 — edited Nov 17 20072 years ago, we set up a 10gR1 cluster on Solaris 10, AMD64. We did so because at the time, 10gR2 was not yet available for our platform. In evaluating ASM for storage redundancy, we came up with some problems.
We had set up several ASM diskgroups, using normal (2-way) redundancy. In each disk group we had 2 disks (each in separate failure groups). The disks were connected to each of our 4 RAC nodes via a Brocade FC4 switch.
We found that under this scenario, when we unplugged the FC connection going to one of the disks in a disk group, the OS hung on the IO. This, of course, is not so good for high availability.
With the results of this test, we migrated to using Solaris Volume Manager to manage the redundancy between our disks, and just created 'external redundancy' disk groups. Now, when we unplug FC, we get a 5 second pause in IO while SVM figures things out, and the database keeps on truckin, which we like.
Lately, though, we have been seeing LOTS of SVM issues. It is finicky, and the metadb's are something of a single point of failure in themselves. We've never lost data, but several times we've had long periods of down time due to the difficulty and complexity of administering this type of storage.
So... my question to the forum is, what sort of storage configuration or solution can someone suggest that allows one to pull the connection to half of a normal-redundant disk group? Has anyone tried this using ASM redundancy mechanisms with better success than we had?
I'd love to be able to get away from SVM, but not at the expense of having my availability being tied to a single FC cable.
Thanks for any tips.