Infrastructure Software

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Failover didn't fail over

807567Apr 14 2009 — edited Apr 20 2009

We have a 2-node cluster on 10u6, with sun cluster 3.2u2. It has a two resource groups, each containing HA-StoragePlus (zfs), ha-smb, and ha-nfs resources. Last night, there was some sort of SCSI timeout, which led to the failure of the resource groups. Instead of failing over to node2, however, they stayed in an error-stopped state on node1. After I was unable to manually fix anything this morning, I rebooted node1, and failover finally initiated. Everything seems to be working fine now on the secondary node, with no intervention from my part, other than shutting down node1. Why would the cluster put the RG into error-stopped state, instead of fail over to the second node?

This is the log from node2 for the time of the RG changes:

cletus=node2
brandine=node1

Apr 13 22:15:57 cletus Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource spuckler-smbd-rs status on node brandine change to R_FM_DEGRADED
Apr 13 22:15:57 cletus Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource spuckler-smbd-rs status msg on node brandine change to <Service is degraded.>
Apr 13 22:17:27 cletus Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] resource group spuckler-rg state on node brandine change to RG_ON_PENDING_R_RESTART
Apr 13 22:17:28 cletus Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] resource spuckler-smbd-rs state on node brandine change to R_ONLINE_UNMON
Apr 13 22:17:28 cletus Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource spuckler-smbd-rs status on node brandine change to R_FM_UNKNOWN
Apr 13 22:17:28 cletus Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource spuckler-smbd-rs status msg on node brandine change to <Stopping>
Apr 13 22:17:28 cletus Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] resource spuckler-smbd-rs state on node brandine change to R_STOPPING
Apr 13 22:17:58 cletus Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource delroy-smbd-rs status on node brandine change to R_FM_DEGRADED
Apr 13 22:17:58 cletus Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource delroy-smbd-rs status msg on node brandine change to <Service is degraded.>

Here's the log for when I powered off node1 ( brandine):

Apr 14 09:06:13 cletus Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] resource group spuckler-rg state on node brandine change to RG_PENDING_OFFLINE
Apr 14 09:06:13 cletus Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] resource group spuckler-rg state on node brandine change to RG_PENDING_OFF_STOP_FAILED
Apr 14 09:06:13 cletus Cluster.RGM.global.rgmd: [ID 424774 daemon.error] Resource group <spuckler-rg> requires operator attention due to STOP failure
Apr 14 09:06:13 cletus Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] resource spuckler-winbind-rs state on node brandine change to R_ONLINE_UNMON
Apr 14 09:06:13 cletus Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource spuckler-winbind-rs status on node brandine change to R_FM_UNKNOWN
Apr 14 09:06:13 cletus Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource spuckler-winbind-rs status msg on node brandine change to <Stopping>
Apr 14 09:06:13 cletus Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] resource spuckler-winbind-rs state on node brandine change to R_STOPPING
Apr 14 09:06:13 cletus Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] resource spuckler_pool-hastorageplus-rs state on node brandine change to R_ONLINE_UNMON
Apr 14 09:06:13 cletus Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] resource spuckler-lh-rs state on node brandine change to R_ONLINE_UNMON
Apr 14 09:06:13 cletus Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] resource san-spuckler-lh-rs state on node brandine change to R_ONLINE_UNMON
Apr 14 09:09:09 cletus savd: [ID 624942 daemon.crit] update.check: Successfully updated Sophos Anti-Virus from \\Skynet\InterChk\EESAVUNIX
Apr 14 09:10:55 cletus Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] resource spuckler-winbind-rs state on node brandine change to R_STOP_FAILED
Apr 14 09:10:55 cletus Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource spuckler-winbind-rs status on node brandine change to R_FM_FAULTED
Apr 14 09:10:55 cletus Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource spuckler-winbind-rs status msg on node brandine change to <>
Apr 14 09:11:16 cletus Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] resource spuckler-nfs-rs state on node brandine change to R_STOP_FAILED
Apr 14 09:11:16 cletus Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource spuckler-nfs-rs status on node brandine change to R_FM_FAULTED

After that, resources all went offline on brandine, and came online on cletus.

Locked Post

New comments cannot be posted to this locked post.

Locked on May 18 2009

Added on Apr 14 2009

#clustering, #oracle-solaris

3 comments

661 views