Hi,
We have OVS version 3.1.1 and 3.1.2 running on Dell servers. We use EMC Clariion and HP 3PAR storage arrays which these servers use for their shared storage. Many times, we are seeing OVS servers rebooting with below messages. When we look at logs on SAN side or on storage arrays, we do not really see anything unusual that would cause heartbeat disk to be not available for 60 seconds or so. Interestingly, not all OVS servers in a cluster reboot. Some servers reboot & some do not. This happens randomly with these OVM servers irrespective of 3PAR storage or EMC Clariion storage. When we look at OVM server log messages, it always complains about all paths failed for all shared storage LUNs including heartbeat disks.
Jan 3 00:03:26 ovslx505 o2hbmonitor: Last ping 46838 msecs ago on /dev/dm-0, 0004FB0000050000E018378E1B9D075D
Currently, heartbeat dead threshold is set to 60 seconds (31). We had opened many cases with Oracle but it goes no where. Recent recommendation is to set heartbeat dead threshold to 120 seconds (61).
But it is strange that why all paths would go down at once via both HBAs. It is not making sense as both HBAs connect via separate fabrics & there is nothing on SAN switches (Brocade) that tells us why would be an issue. For some reason, some OVS servers are detecting as heartbeat disk is unavailable over all 4 paths & fence themselves resulting in their reboots. Again, not all OVS servers reboot in a cluster. Just a few. This happens randomly.
Has anyone seen this kind of issue ? We are trying to determine what could be causing these reboots. Storage side, things look OK.
Thanks,
Sameer