The scenario I am curious of is that if you have a file system set up and it has one server as the primary metadata server. If that server has a problem then how do we fail over to the secondary metadata server with no loss of data to the end users. We do SAMFSDUMPS nightly and we move that to another server so that we have the extra copies. I know that you can run the SAMFSRESTORE and point it to that file from the dump and it will basically add the pointers to the data to the new metadata server on a different file system. My biggest question would be can you have an active/passive connection between two metadata servers or do you just take the dump and restore it to another file system?
From Tim Whalen
In order to have any automated fail over Sun Cluster 3.2 must be installed with SAM-QFS 4.6 (current patch).
A manual fail over process can be used, but the guidelines are very strict. The primary system needs to come down first, and then the secondary system (passive) needs to be brought up. Other environment specific issues need to be taken into consideration for this manual fail over process.
From Mike Selway
Reflecting on some of your points/concerns... when you set up a Shared SAM-QFS system (as implied by the fact that you referred to one server as the MDS), setting up active/passive is a safe way to ensure a fail-over strategy. In this architecture, there are fewer things to worry about in the fail-over, and the likelihood of "split brain" can be procedurally managed more simply. Sun Cluster can be used as Tim says to automate the process of fail-over from one MDS to the other designated possible MDS, but Sun Cluster brings with it its own overhead and management requirements/complexities.
If your overall Shared SAM-QFS environment is very straight forward, you can follow the manual on how the fail-over will occur, culminating in the SAMSHAREFS command being used to shift management from Server 1 to Server 2. The number/types of clients are not so much a concern as they will generally not "know" who the MDS really is; just an address to somewhere. So when it changes, the clients pick up on that, and continue on. The largest concern of fail-over is the termination of active leases. These are given out for read, write, and append actions to specific clients/app's doing something to a specific file. A controlled fail-over is ideal as the leases will be allowed to complete/expire until none are left. This in a sense assures no data loss in the turnover. Catastrophic failure is a different problem, and as with any catastrophe, it will be necessary normally to "see" what has or has not completed at the time of failure in order to understand what needs to be redone. Partial file creates are usually the issue, although in progress updates can be more a problem. Going into all the what if's is more than we want in this mail, but there are seasoned customers out there who can offer insights.
As for the SAMFSDUMPS, these are your snaps of the state of the file system, and doing them nightly gives you effectively a daily restore point. If you find that they don't take long to complete, you might look at increasing the frequency to hourly or something. SAMFSRESTORE normally is used to fully recover the file system. More surgical approaches can be taken if you do more surgical SAMFSDUMPS, but that too is another email. But generally speaking, in the event of server fail-over, you don't need to do a SAMFSRESTORE. You haven't lost the file system, just the server managing it.