VM Server rebooting for unknown reason
Hi folks,
I have a testsystem with two OVM server (3.0.3) hosting four RACs. That means on each VM server I have four RAC-VMs running. This cinfiguration worked well for three month. Even nothing changed we now facing the problem that the second VM server reboots. I checked the message file and it seems that there are som problems with the SAN storage (NetApp via FC). But only the second node is affected and from SAN side (storage and FS switches) there is nothing unusual. What I have found out so far is that if I dont start any of the VM the server survives but when I only start one VM after some time the server reboots.
Here the message log:
...
Jan 8 18:30:29 bdtzlp02 multipathd: 360a98000646e4f4e674a6b662f4d2f7a: sdev - directio checker reports path is down
Jan 8 18:30:29 bdtzlp02 multipathd: checker failed path 129:112 in map 360a98000646e4f4e674a6b662f4d2f7a
Jan 8 18:30:29 bdtzlp02 multipathd: 360a98000646e4f4e674a6b662f4d2f7a: remaining active paths: 5
Jan 8 18:30:29 bdtzlp02 kernel: [14753.710463] device-mapper: multipath: Failing path 129:112.
Jan 8 18:30:29 bdtzlp02 multipathd: 360a98000646e4f4e674a6b665a78314a: sdeo - directio checker reports path is down
Jan 8 18:30:29 bdtzlp02 multipathd: checker failed path 129:0 in map 360a98000646e4f4e674a6b665a78314a
Jan 8 18:30:29 bdtzlp02 multipathd: 360a98000646e4f4e674a6b665a78314a: remaining active paths: 5
Jan 8 18:30:29 bdtzlp02 kernel: [14753.714442] device-mapper: multipath: Failing path 129:0.
Jan 8 18:30:29 bdtzlp02 multipathd: 360a98000646e4f4e674a6b662f47346e: sdfy - directio checker reports path is down
Jan 8 18:30:29 bdtzlp02 multipathd: checker failed path 131:64 in map 360a98000646e4f4e674a6b662f47346e
Jan 8 18:30:29 bdtzlp02 kernel: [14753.722444] device-mapper: multipath: Failing path 131:64.
Jan 8 18:30:29 bdtzlp02 multipathd: 360a98000646e4f4e674a6b662f47346e: remaining active paths: 4
Jan 8 18:30:29 bdtzlp02 multipathd: dm-28: add map (uevent)
...
Jan 8 18:30:50 bdtzlp02 o2hbmonitor: Last ping 55812 msecs ago on /dev/dm-1, 0004FB0000050000A514E628F76D9293
Jan 8 18:30:52 bdtzlp02 o2hbmonitor: Last ping 57816 msecs ago on /dev/dm-1, 0004FB0000050000A514E628F76D9293
Jan 8 18:30:53 bdtzlp02 multipathd: 360a98000646e4f4e674a6b346578555a: sdec - directio checker reports path is down
Jan 8 18:30:53 bdtzlp02 multipathd: 360a98000646e4f4e674a6b34665a6d6d: sdef - directio checker reports path is down
Jan 8 18:30:53 bdtzlp02 multipathd: 360a98000646e4f4e674a6b346666484a: sdeg - directio checker reports path is down
Jan 8 18:30:53 bdtzlp02 multipathd: 360a98000646e4f4e674a6b662f417434: sder - directio checker reports path is down
Jan 8 18:30:53 bdtzlp02 multipathd: 360a98000646e4f4e674a6c7977423655: sdge - directio checker reports path is down
Jan 8 18:30:54 bdtzlp02 o2hbmonitor: Last ping 59820 msecs ago on /dev/dm-1, 0004FB0000050000A514E628F76D9293
Jan 8 18:30:54 bdtzlp02 multipathd: 360a98000646e4f4e674a6b665a685774: sdek - directio checker reports path is down
Jan 8 18:30:54 bdtzlp02 multipathd: 360a98000646e4f4e674a6b665a59506e: sdei - directio checker reports path is down
Jan 8 18:30:54 bdtzlp02 multipathd: 360a98000646e4f4e674a6b662f47346e: sdet - directio checker reports path is down
Jan 8 19:38:33 bdtzlp02 syslogd 1.4.1: restart.
Hint: system time is different from ntp time, hence there is a 'time jump' in the messages.
It seems to me that when I start one VM and SAN access increases then the server gets problems.
Which files to check? Is there a command the check the status of the FC-hba? Other ideas?
Thanks for your help & regards
Axel D.