Cluster Node Crashes
668481Nov 3 2008 — edited Nov 7 2008I'm not sure this is the proper forum for this post, if it's not please feel free to move it.
The situation I'm facing is this:
My company has clusters setup across North America with our software that utilizes the Oracle database. 90% of the time everything functions exactly as it is supposed to. However, it is the other 10% of sites that I am here to ask about.
Our clusters are setup in a dual-server environment that basically act as a single server. The application runs on one server and the database runs on another, and in the case of problems, either can be failed over to run both sets of services on a single server (basic, I realize). At certain sites we are unable to run services on one of the nodes. When they are run as they are supposed to, every so often (at some sites a matter of minutes/hours, at others it can be a couple weeks) they will BSOD.
I fully understand what the blue screen is. The minidump shows that it's the orafencedrv.sys stop, where the Oracle database shuts down a node after loss of communications in order to prevent corruption of the database. This is a great feature and I'm grateful for it, however it has caused us many headaches in diagnosing what it actually causing the drop in communications.
The interconnect and the public IP are both hooked up over a single switch but they operate on different subnets. Could operating on a single switch be part of the problem?
Could the problem be that the switches are being overloaded with traffic causing temporary packet losses between the two nodes, which I know is enough to have Oracle BSOD a node?
Below I'm posting one of the dumps listed in the CSSD log when the node crashes, hopefully this will provide some sort of information as to what is happening.
If any other information is needed, please feel free to let me know. Thanks for your help in advance.
[ CSSD]2008-10-29 13:30:06.211 [2732] >ERROR: clssnmvDiskKillCheck: Aborting, evicted by node 1, sync 13, stamp 99832890,
[ CSSD]2008-10-29 13:30:06.211 [2732] >ERROR: ###################################
[ CSSD]2008-10-29 13:30:06.211 [2732] >ERROR: clssscExit: CSSD aborting
[ CSSD]2008-10-29 13:30:06.211 [2732] >ERROR: ###################################
[ CSSD]--- DUMP GROCK STATE DB ---
[ CSSD]----------
[ CSSD] type 2, Id 3, Name = (crs_version)
[ CSSD] flags: 0x0
[ CSSD] grant: count=0, type 0, wait 0
[ CSSD] Member Count =2, master 0
[ CSSD] . . . . .
[ CSSD] memberNo =0, seq 5
[ CSSD] flags = 0x0, granted 0
[ CSSD] refCnt = 1
[ CSSD] nodeNum = 2, nodeBirth 6
[ CSSD] privateDataSize = 0
[ CSSD] publicDataSize = 0
[ CSSD] . . . . .
[ CSSD] memberNo =1, seq 11
[ CSSD] flags = 0x0, granted 0
[ CSSD] refCnt = 1
[ CSSD] nodeNum = 1, nodeBirth 12
[ CSSD] privateDataSize = 0
[ CSSD] publicDataSize = 0
[ CSSD]----------
[ CSSD]----------
[ CSSD] type 2, Id 2, Name = (ocr_STLRZOPRCL)
[ CSSD] flags: 0x0
[ CSSD] grant: count=0, type 0, wait 0
[ CSSD] Member Count =2, master 2
[ CSSD] . . . . .
[ CSSD] memberNo =2, seq 5
[ CSSD] flags = 0x0, granted 0
[ CSSD] refCnt = 1
[ CSSD] nodeNum = 2, nodeBirth 6
[ CSSD] privateDataSize = 0
[ CSSD] publicDataSize = 32
[ CSSD] . . . . .
[ CSSD] memberNo =1, seq 11
[ CSSD] flags = 0x0, granted 0
[ CSSD] refCnt = 1
[ CSSD] nodeNum = 1, nodeBirth 12
[ CSSD] privateDataSize = 0
[ CSSD] publicDataSize = 32
[ CSSD]----------
[ CSSD]----------
[ CSSD] type 3, Id 15, Name = (_ORA_CRS_MEMBER_stlrzoprcl1)
[ CSSD] flags: 0x0
[ CSSD] grant: count=1, type 3, wait 1
[ CSSD] Member Count =1, master -3
[ CSSD] . . . . .
[ CSSD] memberNo =0, seq 0
[ CSSD] flags = 0x12, granted 0
[ CSSD] refCnt = 1
[ CSSD] nodeNum = 1, nodeBirth 12
[ CSSD] privateDataSize = 0
[ CSSD] publicDataSize = 0
[ CSSD]----------
[ CSSD]----------
[ CSSD] type 3, Id 15, Name = (_ORA_CRS_MEMBER_stlrzoprcl2)
[ CSSD] flags: 0x0
[ CSSD] grant: count=1, type 3, wait 1
[ CSSD] Member Count =1, master -3
[ CSSD] . . . . .
[ CSSD] memberNo =0, seq 0
[ CSSD] flags = 0x12, granted 1
[ CSSD] refCnt = 1
[ CSSD] nodeNum = 2, nodeBirth 6
[ CSSD] privateDataSize = 0
[ CSSD] publicDataSize = 0
[ CSSD]----------
[ CSSD]----------
[ CSSD] type 2, Id 4, Name = (CRSDMAIN)
[ CSSD] flags: 0x0
[ CSSD] grant: count=0, type 0, wait 0
[ CSSD] Member Count =2, master 2
[ CSSD] . . . . .
[ CSSD] memberNo =2, seq 5
[ CSSD] flags = 0x0, granted 0
[ CSSD] refCnt = 1
[ CSSD] nodeNum = 2, nodeBirth 6
[ CSSD] privateDataSize = 128
[ CSSD] publicDataSize = 128
[ CSSD] . . . . .
[ CSSD] memberNo =1, seq 11
[ CSSD] flags = 0x0, granted 0
[ CSSD] refCnt = 1
[ CSSD] nodeNum = 1, nodeBirth 12
[ CSSD] privateDataSize = 128
[ CSSD] publicDataSize = 128
[ CSSD]----------
[ CSSD]----------
[ CSSD] type 2, Id 1, Name = (EVMDMAIN)
[ CSSD] flags: 0x0
[ CSSD] grant: count=0, type 0, wait 0
[ CSSD] Member Count =2, master 2
[ CSSD] . . . . .
[ CSSD] memberNo =2, seq 5
[ CSSD] flags = 0x0, granted 0
[ CSSD] refCnt = 1
[ CSSD] nodeNum = 2, nodeBirth 6
[ CSSD] privateDataSize = 508
[ CSSD] publicDataSize = 504
[ CSSD] . . . . .
[ CSSD] memberNo =1, seq 11
[ CSSD] flags = 0x0, granted 0
[ CSSD] refCnt = 1
[ CSSD] nodeNum = 1, nodeBirth 12
[ CSSD] privateDataSize = 508
[ CSSD] publicDataSize = 504
[ CSSD]----------
[ CSSD]--- END OF GROCK STATE DUMP ---
[ CSSD]------- End Dump -------