Hi,
i'm facing a problem that i can't understand: asm instances on 2 nodes are restarting for the same reason and all the databases too after the ASM crash (obviously), here the alert log of ASM.
This is a grid installation (11.2.0.4) on 2 windows 2012 server nodes:
Unexpected return code (6) from the Cluster Synchronization Service (LCK0)
Please check the CSS log file for more detail
Errors in file D:\APP\ORACLE\diag\asm\+asm\+asm1\trace\+asm1_lck0_8628.trc:
ORA-29701: unable to connect to Cluster Synchronization Service
LCK0 (ospid: 8628): terminating the instance due to error 29701
Tue Jun 17 23:42:07 2014
Errors in file D:\APP\ORACLE\diag\asm\+asm\+asm1\trace\+asm1_diag_3772.trc:
ORA-27508: IPC error sending a message
ORA-27300: OS system dependent operation:IPCSOCK_Send failed with status: 10054
ORA-27301: OS failure message: An existing connection was forcibly closed by the remote host.
ORA-27302: failure occurred at: send_3
System state dump requested by (instance=1, osid=8628 (LCK0)), summary=[abnormal instance termination].
System State dumped to trace file D:\APP\ORACLE\diag\asm\+asm\+asm1\trace\+asm1_diag_3772_20140617234207.trc
Dumping diagnostic data in directory=[cdmp_20140617234207], requested by (instance=1, osid=8628 (LCK0)), summary=[abnormal instance termination].
Tue Jun 17 23:42:08 2014
opiodr aborting process unknown ospid (3824) as a result of ORA-1092
Tue Jun 17 23:42:08 2014
ORA-1092 : opitsk aborting process
Tue Jun 17 23:42:12 2014
NOTE: ASMB process exiting, either shutdown is in progress
NOTE: or foreground connected to ASMB was killed.
Tue Jun 17 23:42:14 2014
ORA-1092 : opitsk aborting process
Tue Jun 17 23:42:14 2014
License high water mark = 6
Tue Jun 17 23:42:19 2014
Instance terminated by LCK0, pid = 8628
Tue Jun 17 23:42:48 2014
MEMORY_TARGET defaulting to 1128267776.
* instance_number obtained from CSS = 1, checking for the existence of node 0...
* node 0 does not exist. instance_number = 1
then i've inspected the 2 traces:
here an extract of +asm1_lck0_8628.trc:
KSXPCINI: kjxgnpub KSXP 4.1.0LCK0 10, 0
2014-06-17 23:42:07.020: [ CSSCLNT]clsssRecvMsg: got a disconnect from the server while waiting for message type 1
*** 2014-06-17 23:42:07.020
2014-06-17 23:42:07.020: [ CSSCLNT]clssgsGroupGetStatus: communications failed (0/3/-1)
2014-06-17 23:42:07.020: [ CSSCLNT]clssgsGroupGetStatus: returning 8
kgxgnpstat: received ABORT event from CLSS
ksimpoll: kgxgnpstat returns ABORT for group (IG+ASMSYS$BACKGROUND)
error 29701 detected in background process
ORA-29701: unable to connect to Cluster Synchronization Service
kjzduptcctx: Notifying DIAG for crash event
----- Abridged Call Stack Trace -----
ksedsts()+585<-kjzdssdmp()+329<-kjzduptcctx()+288<-kjzdicrshnfy()+99<-ksuitm()+1525<-ksbrdp()+4578<-opirip()+853<-opidrv()+906<-sou2o()+98<-opimai_real()+280<-opimai()+191<-BackgroundThreadStart()+646<-000007FDD92C1832<-000007FDDBE5D609
----- End of Abridged Call Stack Trace -----
*** 2014-06-17 23:42:07.138
LCK0 (ospid: 8628): terminating the instance due to error 29701
Dump file d:\app\oracle\diag\asm\+asm\+asm1\trace\+asm1_lck0_8628.trc
Tue Jun 17 23:42:12 2014
ORACLE V11.2.0.4.0 - 64bit Production vsnsta=0
vsnsql=16 vsnxtr=3
*** 2014-06-17 23:42:12.359 ssthreadkillsafe: about to forcefully terminate tid=8540, dumping call stack:
here an extract of +asm1_diag_3772.trc:
*** ACTION NAME:() 2014-06-17 19:42:51.874
Node id: 0
List of nodes: 0, 1,
*** 2014-06-17 19:42:51.876
Reconfiguration starts [incarn=0]
*** 2014-06-17 19:42:51.876
I'm the master node
Group reconfiguration cleanup
*** 2014-06-17 19:42:51.878
A rcfg proposal from node 1 is received
vote_map :3000000000000000000000000000000000000000000000000000000000000000
expect_map :3000000000000000000000000000000000000000000000000000000000000000
reply_map :2000000000000000000000000000000000000000000000000000000000000000
vote->incar_num 8, ctx->incar_num 0, commit_incar 1, prop_incar 1
*** 2014-06-17 19:42:51.878
Reconfiguration completes [incarn=8]
kjzdattdlm: Can not attach to DLM (LMON up=[FALSE], DB mounted=[FALSE]).
*** 2014-06-17 19:42:51.878
kjzdm_rac: enabling heartbeat after reconfig
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).
*** 2014-06-17 19:42:52.879
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).
kjzdattdlm:A rcfg proposal from node 1 is received Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).
*** 2014-06-17 19:42:53.905
DIAG attached to DLM
2014-06-17 23:42:05.840: [ CSSCLNT]clsssRecvMsg: got a disconnect from the server while waiting for message type 1
*** 2014-06-17 23:42:05.840
2014-06-17 23:42:05.840: [ CSSCLNT]clssgsGroupGetStatus: communications failed (0/3/-1)
2014-06-17 23:42:05.840: [ CSSCLNT]clssgsGroupGetStatus: returning 8
kgxgnpstat: received ABORT event from CLSS
CM problem, please abort
*** 2014-06-17 23:42:05.840
Node monitor becomes unavailable for service
2014-06-17 23:42:06.040: [ CSSCLNT]clsssRecvMsg: got a disconnect from the server while waiting for message type 1
2014-06-17 23:42:06.040: [ CSSCLNT]clssgsGroupGetStatus: communications failed (0/3/-1)
2014-06-17 23:42:06.040: [ CSSCLNT]clssgsGroupGetStatus: returning 8
kgxgnpstat: received ABORT event from CLSS
CM problem, please abort
Any idea???
Thanks in advance!
EDIT:
In the event log of windows i find this accordint to the time when asm instances crash:
TCP/IP failed to establish an outgoing connection because the selected local endpoint was recently used to connect to the same remote endpoint. This error typically occurs when outgoing connections are opened and closed at a high rate, causing all available local ports to be used and forcing TCP/IP to reuse a local port for an outgoing connection. To minimize the risk of data corruption, the TCP/IP standard requires a minimum time period to elapse between successive connections from a given local endpoint to a given remote endpoint.
for now i've disabled all other processes not releated to oracle.