让人困扰的RAC 节点逐出问题
lhl1212Nov 1 2012 — edited Nov 6 2012一套2节点的10.2.0.4的RAC ,不定期出现一节点(节点1)无法逐出未自动重启,导致整个RAC被HANG住了。故障发生时被踢出节点OS无法登录控制台查看系统资源使用情况,也没有该节点相关数据库日志,只有另一节点的。是不是也得从五台山请个小狮子回来镇一下呢。还是请刘大帮忙分析一下其中原因。节点2的相关日志已发往您的邮箱。
OS: REDHAT 4
相关日志:
节点1的系统日志:
Oct 31 15:30:28 bf01 sshd(pam_unix)[3356]: session closed for user mocha
Oct 31 15:30:45 bf01 kernel: Losing some ticks... checking if CPU frequency changed.
Oct 31 15:31:08 bf01 kernel: warning: many lost ticks.
Oct 31 15:31:08 bf01 kernel: Your time source seems to be instable or some driver is hogging interupts
Oct 31 15:31:08 bf01 kernel: rip __do_softirq+0x4d/0xd0 <----------之后便无响应,冷启动该节点
Oct 31 15:58:56 bf01 syslogd 1.4.1: restart.
Oct 31 15:58:56 bf01 syslog: syslogd startup succeeded
Oct 31 15:58:56 bf01 kernel: klogd 1.4.1, log source = /proc/kmsg started.
......
BF02的数据库日志:
Wed Oct 31 15:17:15 2012
Thread 2 advanced to log sequence 104758 (LGWR switch)
Current log# 6 seq# 104758 mem# 0: /dev/raw/raw23
Wed Oct 31 15:36:30 2012
IPC Send timeout detected.Sender: ospid 20539
Receiver: inst 1 binc 429417533 ospid 20263
Wed Oct 31 15:36:31 2012
IPC Send timeout detected.Sender: ospid 20982
Receiver: inst 1 binc 429417533 ospid 20263
Wed Oct 31 15:36:50 2012
IPC Send timeout detected.Sender: ospid 10334
Receiver: inst 1 binc 429417533 ospid 20263
Wed Oct 31 15:36:58 2012
IPC Send timeout to 0.0 inc 4 for msg type 12 from opid 60
Wed Oct 31 15:36:58 2012
Communications reconfiguration: instance_number 1
Wed Oct 31 15:37:02 2012
IPC Send timeout to 0.0 inc 4 for msg type 8 from opid 14
Wed Oct 31 15:37:22 2012
IPC Send timeout to 0.0 inc 4 for msg type 16 from opid 160
Wed Oct 31 15:37:26 2012
IPC Send timeout detected.Sender: ospid 31899
Receiver: inst 1 binc 429417533 ospid 20263
Wed Oct 31 15:37:32 2012
Trace dumping is performing id=[cdmp_20121031153701]
Wed Oct 31 15:37:35 2012
IPC Send timeout detected.Sender: ospid 20549
Receiver: inst 1 binc 429417533 ospid 20263
Wed Oct 31 15:37:58 2012
IPC Send timeout to 0.0 inc 4 for msg type 16 from opid 507
Wed Oct 31 15:38:00 2012
IPC Send timeout to 0.0 inc 4 for msg type 8 from opid 19
Wed Oct 31 15:38:11 2012
IPC Send timeout detected.Sender: ospid 20543
Receiver: inst 1 binc 429417533 ospid 20263
Wed Oct 31 15:38:18 2012
Trace dumping is performing id=[cdmp_20121031153701]
Wed Oct 31 15:38:41 2012
IPC Send timeout detected.Sender: ospid 12157
Receiver: inst 1 binc 429417533 ospid 20263
Wed Oct 31 15:38:43 2012
IPC Send timeout to 0.0 inc 4 for msg type 16 from opid 16
Wed Oct 31 15:38:48 2012
Evicting instance 1 from cluster
Wed Oct 31 15:39:12 2012
IPC Send timeout to 0.0 inc 4 for msg type 16 from opid 145
Wed Oct 31 15:39:29 2012
Waiting for instances to leave:
1
Wed Oct 31 15:39:38 2012
IPC Send timeout detected.Sender: ospid 15417
Receiver: inst 1 binc 429417533 ospid 20263
Wed Oct 31 15:39:49 2012
Waiting for instances to leave:
1
......
后面就一直在等待节点1踢出,持续了将近10多分钟。使用crs_stat -t 查看也无响应,RAC似乎被HANG住,因此后面手动重启节点1后恢复正常。