System crash of two nodes 11g r2 rac
858557Apr 29 2011 — edited May 2 2011HI
I get a surprising problem.The rac system will crash every 5 or 6 days.
there is no valuable information on rac1,but in ocssd.log have some errors
2011-08-26 16:15:15.592: [ CSSD][1137830208]clssgmExecuteClientRequest: Node name request from client ((nil))
2011-08-26 16:15:15.594: [ CSSD][1137830208]clssgmExecuteClientRequest: Node name request from client ((nil))
2011-08-26 16:15:15.595: [ CSSD][1137830208]clssgmExecuteClientRequest: NODELIST request from client ((nil))
2011-08-26 16:15:15.595: [ CSSD][1137830208]clssgmNodeList: proc(0x2aaab4203350), client((nil)) with option 4
2011-08-26 16:15:15.595: [ SKGFD][1232238912]BigInit
2011-08-26 16:15:15.596: [ SKGFD][1211259200]BigInit
2011-08-26 16:15:15.596: [ SKGFD][1179789632]BigInit
2011-08-26 16:15:15.596: [ SKGFD][1232238912]kgfkrq (0x2aaab016af10) of status 0 dump:
2011-08-26 16:15:15.596: [ SKGFD][1211259200]kgfkrq (0x2aaaac7a3210) of status 0 dump:
2011-08-26 16:15:15.596: [ SKGFD][1179789632]kgfkrq (0x2aaab41c4a00) of status 0 dump:
2011-08-26 16:15:15.596: [ SKGFD][1232238912]0x2aaab4010410 524304
2011-08-26 16:15:15.596: [ SKGFD][1211259200]0x2aaaac3647a0 524304
2011-08-26 16:15:15.596: [ SKGFD][1179789632]0x2aaab00482b0 524304
2011-08-26 16:15:15.596: [ SKGFD][1232238912]0x2aaaab407400 512
2011-08-26 16:15:15.596: [ SKGFD][1211259200]0x2aaaab489400 512
2011-08-26 16:15:15.596: [ SKGFD][1179789632]0x2aaaab50b400 512
2011-08-26 16:15:15.596: [ SKGFD][1211259200] BigInit
2011-08-26 16:15:15.596: [ SKGFD][1232238912] BigInit
2011-08-26 16:15:15.596: [ SKGFD][1211259200] kgfkrq (0x2aaaac458de0) of status 1 dump:
2011-08-26 16:15:15.596: [ SKGFD][1179789632] BigInit
2011-08-26 16:15:15.596: [ SKGFD][1232238912] kgfkrq (0x2aaab0149810) of status 1 dump:
2011-08-26 16:15:15.596: [ SKGFD][1211259200] 0x2aaaac3647a0 524304
2011-08-26 16:15:15.596: [ SKGFD][1179789632] kgfkrq (0x2aaab42ad700) of status 1 dump:
2011-08-26 16:15:15.596: [ SKGFD][1232238912] 0x2aaab4010410 524304
2011-08-26 16:15:15.596: [ SKGFD][1211259200] 0x2aaaab489400 256
2011-08-26 16:15:15.596: [ SKGFD][1179789632] 0x2aaab00482b0 524304
2011-08-26 16:15:15.596: [ SKGFD][1211259200] BigInit
2011-08-26 16:15:15.596: [ SKGFD][1232238912] 0x2aaaab407400 256
2011-08-26 16:15:15.596: [ SKGFD][1211259200] kgfkrq (0x2aaaac458f48) of status 1 dump:
2011-08-26 16:15:15.596: [ SKGFD][1179789632] 0x2aaaab50b400 256
2011-08-26 16:15:15.596: [ SKGFD][1232238912] BigInit
2011-08-26 16:15:15.596: [ SKGFD][1179789632] BigInit
2011-08-26 16:15:15.596: [ SKGFD][1232238912] kgfkrq (0x2aaab0149978) of status 1 dump:
2011-08-26 16:15:15.596: [ SKGFD][1211259200] 0x2aaaac3647a0 524560
2011-08-26 16:15:15.596: [ SKGFD][1179789632] kgfkrq (0x2aaab42ad868) of status 1 dump:
2011-08-26 16:15:15.596: [ SKGFD][1211259200] 0x2aaaab4a9400 256
2011-08-26 16:15:15.596: [ SKGFD][1232238912] 0x2aaab4010410 524560
2011-08-26 16:15:15.596: [ SKGFD][1179789632] 0x2aaab00482b0 524560
2011-08-26 16:15:15.596: [ SKGFD][1232238912] 0x2aaaab427400 256
2011-08-26 16:15:15.596: [ SKGFD][1179789632] 0x2aaaab52b400 256
....
2011-08-26 16:15:22.877: [ CSSD][1263708480]clssgmTagize: version(1), type(13), tagizer(0x494dfe)
2011-08-26 16:15:22.878: [ CSSD][1263708480]clssgmHandleDataInvalid: grock HB+ASM, member 2 node 2, birth 7
2011-08-26 16:15:23.612: [ CSSD][1284688192]clssnmSendingThread: sending status msg to all nodes
2011-08-26 16:15:23.612: [ CSSD][1284688192]clssnmSendingThread: sent 5 status msgs to all nodes
2011-08-26 16:15:24.967: [ CSSD][1263708480]clssgmTagize: version(1), type(13), tagizer(0x494dfe)
2011-08-26 16:15:24.967: [ CSSD][1263708480]clssgmHandleDataInvalid: grock HB+ASM, member 2 node 2, birth 7
2011-08-26 16:15:27.067: [ CSSD][1263708480]clssgmTagize: version(1), type(13), tagizer(0x494dfe)
2011-08-26 16:15:27.068: [ CSSD][1263708480]clssgmHandleDataInvalid: grock HB+ASM, member 2 node 2, birth 7
2011-08-26 16:15:28.044: [ CSSD][1263708480]clssgmTagize: version(1), type(3), tagizer(0x4929ba)
2011-08-26 16:15:28.044: [ CSSD][1263708480]clssgmHandleMasterMemberAdd: [s(2) d(1)]
2011-08-26 16:15:28.044: [ CSSD][1263708480]clssgmGrockOpTagProcess: clssgmCommonAddMember failed, member(-1/CLSN.FAN.racdb.FANPROC[3]) on node(2)
2011-08-26 16:15:28.044: [ CSSD][1263708480]clssgmGrockOpTagProcess: Operation(3) unsuccessful grock(CLSN.FAN.racdb.FANPROC[3])
2011-08-26 16:15:28.044: [ CSSD][1263708480]clssgmHandleMasterJoin: clssgmProcessJoinUpdate failed with status(-10)
2011-08-26 16:15:28.046: [ CSSD][1263708480]clssgmTagize: version(1), type(3), tagizer(0x4929ba)
2011-08-26 16:15:28.046: [ CSSD][1263708480]clssgmHandleMasterMemberAdd: [s(2) d(1)]
rac2 alert display rac1 has been removed but no evicted information
2011-08-26 16:24:14.953
[cssd(8381)]CRS-1612:Network communication with node rac1 (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.740 seconds
2011-08-26 16:24:22.969
[cssd(8381)]CRS-1611:Network communication with node rac1 (1) missing for 75% of timeout interval. Removal of this node from cluster in 6.720 seconds
2011-08-26 16:24:26.977
[cssd(8381)]CRS-1610:Network communication with node rac1 (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.720 seconds
2011-08-26 16:24:29.695
[cssd(8381)]CRS-1632:Node rac1 is being removed from the cluster in cluster incarnation 206618768
2011-08-26 16:24:29.713
[cssd(8381)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac2 .
2011-08-26 16:24:29.739
[ctssd(8550)]CRS-2407:The new Cluster Time Synchronization Service reference node is host rac2.
2011-08-26 16:24:31.487
[crsd(8829)]CRS-5504:Node down event reported for node 'rac1'.
2011-08-26 16:24:36.671
[crsd(8829)]CRS-2773:Server 'rac1' has been removed from pool 'Generic'.
2011-08-26 16:24:36.672
[crsd(8829)]CRS-2773:Server 'rac1' has been removed from pool 'ora.racdb'.
2011-08-26 16:39:28.152
[ctssd(8550)]CRS-2406:The Cluster Time Synchronization Service timed out on host rac2. Details in /oracle/app/grid/product/11.2.0/log/rac2/ctssd/octssd.log.
2011-08-27 10:08:29.454
[ohasd(7688)]CRS-2112:The OLR service started on node rac2.
2011-08-27 10:08:29.762
[ohasd(7688)]CRS-8017:location: /etc/oracle/lastgasp has 26 reboot advisory log files, 0 were announced and 0 errors occurred
2011-08-27 10:08:34.860
[ohasd(7688)]CRS-2772:Server 'rac2' has been assigned to pool 'Free'.
2011-08-27 10:08:39.040
[cssd(8183)]CRS-1713:CSSD daemon is started in clustered mode
2011-08-27 10:08:58.811
[cssd(8183)]CRS-1707:Lease acquisition for node rac2 number 2 completed
what's the problem?
environment:
two servers + san
system: redhat enterprise 5.4
帖子经 855554编辑过