Database Software

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

ASM instances restarting, all database restart too

2819320Jun 18 2014 — edited Jun 20 2014

Hi,

i'm facing a problem that i can't understand: asm instances on 2 nodes are restarting for the same reason and all the databases too after the ASM crash (obviously), here the alert log of ASM.

This is a grid installation (11.2.0.4) on 2 windows 2012 server nodes:

Unexpected return code (6) from the Cluster Synchronization Service (LCK0)

Please check the CSS log file for more detail

Errors in file D:\APP\ORACLE\diag\asm\+asm\+asm1\trace\+asm1_lck0_8628.trc:

ORA-29701: unable to connect to Cluster Synchronization Service

LCK0 (ospid: 8628): terminating the instance due to error 29701

Tue Jun 17 23:42:07 2014

Errors in file D:\APP\ORACLE\diag\asm\+asm\+asm1\trace\+asm1_diag_3772.trc:

ORA-27508: IPC error sending a message

ORA-27300: OS system dependent operation:IPCSOCK_Send failed with status: 10054

ORA-27301: OS failure message: An existing connection was forcibly closed by the remote host.

ORA-27302: failure occurred at: send_3

System state dump requested by (instance=1, osid=8628 (LCK0)), summary=[abnormal instance termination].

System State dumped to trace file D:\APP\ORACLE\diag\asm\+asm\+asm1\trace\+asm1_diag_3772_20140617234207.trc

Dumping diagnostic data in directory=[cdmp_20140617234207], requested by (instance=1, osid=8628 (LCK0)), summary=[abnormal instance termination].

Tue Jun 17 23:42:08 2014

opiodr aborting process unknown ospid (3824) as a result of ORA-1092

Tue Jun 17 23:42:08 2014

ORA-1092 : opitsk aborting process

Tue Jun 17 23:42:12 2014

NOTE: ASMB process exiting, either shutdown is in progress

NOTE: or foreground connected to ASMB was killed.

Tue Jun 17 23:42:14 2014

ORA-1092 : opitsk aborting process

Tue Jun 17 23:42:14 2014

License high water mark = 6

Tue Jun 17 23:42:19 2014

Instance terminated by LCK0, pid = 8628

Tue Jun 17 23:42:48 2014

MEMORY_TARGET defaulting to 1128267776.

* instance_number obtained from CSS = 1, checking for the existence of node 0...

* node 0 does not exist. instance_number = 1

then i've inspected the 2 traces:

here an extract of +asm1_lck0_8628.trc:

KSXPCINI: kjxgnpub KSXP 4.1.0LCK0 10, 0

2014-06-17 23:42:07.020: [ CSSCLNT]clsssRecvMsg: got a disconnect from the server while waiting for message type 1

*** 2014-06-17 23:42:07.020

2014-06-17 23:42:07.020: [ CSSCLNT]clssgsGroupGetStatus: communications failed (0/3/-1)

2014-06-17 23:42:07.020: [ CSSCLNT]clssgsGroupGetStatus: returning 8

kgxgnpstat: received ABORT event from CLSS

ksimpoll: kgxgnpstat returns ABORT for group (IG+ASMSYS$BACKGROUND)

error 29701 detected in background process

ORA-29701: unable to connect to Cluster Synchronization Service

kjzduptcctx: Notifying DIAG for crash event

----- Abridged Call Stack Trace -----

ksedsts()+585<-kjzdssdmp()+329<-kjzduptcctx()+288<-kjzdicrshnfy()+99<-ksuitm()+1525<-ksbrdp()+4578<-opirip()+853<-opidrv()+906<-sou2o()+98<-opimai_real()+280<-opimai()+191<-BackgroundThreadStart()+646<-000007FDD92C1832<-000007FDDBE5D609

----- End of Abridged Call Stack Trace -----

*** 2014-06-17 23:42:07.138

LCK0 (ospid: 8628): terminating the instance due to error 29701

Dump file d:\app\oracle\diag\asm\+asm\+asm1\trace\+asm1_lck0_8628.trc

Tue Jun 17 23:42:12 2014

ORACLE V11.2.0.4.0 - 64bit Production vsnsta=0

vsnsql=16 vsnxtr=3

*** 2014-06-17 23:42:12.359 ssthreadkillsafe: about to forcefully terminate tid=8540, dumping call stack:

here an extract of +asm1_diag_3772.trc:

*** ACTION NAME:() 2014-06-17 19:42:51.874

Node id: 0

List of nodes: 0, 1,

*** 2014-06-17 19:42:51.876

Reconfiguration starts [incarn=0]

*** 2014-06-17 19:42:51.876

I'm the master node

Group reconfiguration cleanup

*** 2014-06-17 19:42:51.878

A rcfg proposal from node 1 is received

vote_map :3000000000000000000000000000000000000000000000000000000000000000

expect_map :3000000000000000000000000000000000000000000000000000000000000000

reply_map :2000000000000000000000000000000000000000000000000000000000000000

vote->incar_num 8, ctx->incar_num 0, commit_incar 1, prop_incar 1

*** 2014-06-17 19:42:51.878

Reconfiguration completes [incarn=8]

kjzdattdlm: Can not attach to DLM (LMON up=[FALSE], DB mounted=[FALSE]).

*** 2014-06-17 19:42:51.878

kjzdm_rac: enabling heartbeat after reconfig

kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).

*** 2014-06-17 19:42:52.879

kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).

kjzdattdlm:A rcfg proposal from node 1 is received Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).

kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).

*** 2014-06-17 19:42:53.905

DIAG attached to DLM

2014-06-17 23:42:05.840: [ CSSCLNT]clsssRecvMsg: got a disconnect from the server while waiting for message type 1

*** 2014-06-17 23:42:05.840

2014-06-17 23:42:05.840: [ CSSCLNT]clssgsGroupGetStatus: communications failed (0/3/-1)

2014-06-17 23:42:05.840: [ CSSCLNT]clssgsGroupGetStatus: returning 8

kgxgnpstat: received ABORT event from CLSS

CM problem, please abort

*** 2014-06-17 23:42:05.840

Node monitor becomes unavailable for service

2014-06-17 23:42:06.040: [ CSSCLNT]clsssRecvMsg: got a disconnect from the server while waiting for message type 1

2014-06-17 23:42:06.040: [ CSSCLNT]clssgsGroupGetStatus: communications failed (0/3/-1)

2014-06-17 23:42:06.040: [ CSSCLNT]clssgsGroupGetStatus: returning 8

kgxgnpstat: received ABORT event from CLSS

CM problem, please abort

Any idea???

Thanks in advance!

EDIT:

In the event log of windows i find this accordint to the time when asm instances crash:

TCP/IP failed to establish an outgoing connection because the selected local endpoint was recently used to connect to the same remote endpoint. This error typically occurs when outgoing connections are opened and closed at a high rate, causing all available local ports to be used and forcing TCP/IP to reuse a local port for an outgoing connection. To minimize the risk of data corruption, the TCP/IP standard requires a minimum time period to elapse between successive connections from a given local endpoint to a given remote endpoint.

for now i've disabled all other processes not releated to oracle.

Locked Post

New comments cannot be posted to this locked post.

Locked on Jul 18 2014

Added on Jun 18 2014

#11gr2, #asm, #grid, #performance-availability, #rac-asm-clusterware-installation, #windows

13 comments

7,668 views