Skip to Main Content

Database Software

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Failed to restart the CSSD during the interconnect failure

855041Apr 13 2011 — edited Apr 15 2011
Hi all,

I run a small ATP on my LAB where i have

- 2x nodes RAC 11.2.0.2 & ASM (my OCR & Voting files are stored on ASM)
- 1 public interface <> eth0
- 1 private interface <> eth1
- 1 SCAN IP defined in the /etc/hosts file (i'm not using DNS or GNS)

The test i run was to shutdown the private interface (eth1) on node 1 and i saw that

1) all cluster services and cluster daemons on node 2 were killed and node 2 was evicted from the cluster by node 1
2) all new connections were redirected to the survived node
3) Oracle OHASD daemon was restarted on node 2 and tried to start the cluster services without success because private network between cluster nodes was down

Up to here everything worked as expected but once i turn on eth1 it took ~ 9 minutes for the CSSD to startup and bring all the components up & running.

The node2 alert logs showes
 
[ctssd(12949)]CRS-2402:The Cluster Time Synchronization Service aborted on host node2. Details at (:ctss_css_init1:) in /u01/oracle/installed/oracle_cluster-11.2.0.2-1/log/node2/ctssd/octssd.log. 
2011-04-13 08:09:40.978 
[ohasd(5058)]CRS-2765:Resource 'ora.cssd' has failed on server 'node2'. 
2011-04-13 08:09:40.985 
[/u01/oracle/installed/oracle_cluster-11.2.0.2-1/bin/oraagent.bin(5764)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/oracle/installed/oracle_cluster-11.2.0.2-1/log/node2/agent/ohasd/oraagent_oracle/oraagent_oracle.log"; 
2011-04-13 08:09:41.169 
[ohasd(5058)]CRS-2765:Resource 'ora.asm' has failed on server 'node2'. 
2011-04-13 08:09:50.337 
[cssd(13103)]CRS-1713:CSSD daemon is started in clustered mode 
2011-04-13 08:10:05.833 
[cssd(13103)]CRS-1707:Lease acquisition for node node2 number 2 completed 
2011-04-13 08:10:07.119 
[cssd(13103)]CRS-1605:CSSD voting file is online: ORCL:CRS_DISK1_2G; details in /u01/oracle/installed/oracle_cluster-11.2.0.2-1/log/node2/cssd/ocssd.log. 
2011-04-13 08:10:07.121 
[cssd(13103)]CRS-1605:CSSD voting file is online: ORCL:CRS_DISK2_2G; details in /u01/oracle/installed/oracle_cluster-11.2.0.2-1/log/node2/cssd/ocssd.log. 
2011-04-13 08:10:07.143 
[cssd(13103)]CRS-1605:CSSD voting file is online: ORCL:CRS_DISK1_2G; details in /u01/oracle/installed/oracle_cluster-11.2.0.2-1/log/node2/cssd/ocssd.log. 
2011-04-13 08:19:49.386 
[/u01/oracle/installed/oracle_cluster-11.2.0.2-1/bin/cssdagent(13091)]CRS-5818:Aborted command 'start for resource: ora.cssd 1 1' for resource 'ora.cssd'. Details at (:CRSAGF00113:) {0:6:7} in /u01/oracle/installed/oracle_cluster-11.2.0.2-1/log/node2/agent/ohasd/oracssdagent_root/oracssdagent_root.log. 
2011-04-13 08:19:49.387 
[cssd(13103)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/oracle/installed/oracle_cluster-11.2.0.2-1/log/node2/cssd/ocssd.log 
2011-04-13 08:19:49.387 
[cssd(13103)]CRS-1603:CSSD on node node2 shutdown by user. 
2011-04-13 08:19:54.501 
[ohasd(5058)]CRS-2765:Resource 'ora.cssdmonitor' has failed on server 'node2'. 
2011-04-13 08:19:57.723 
[cssd(17068)]CRS-1713:CSSD daemon is started in clustered mode 
2011-04-13 08:20:01.177 
[ohasd(5058)]CRS-2765:Resource 'ora.diskmon' has failed on server 'node2'. 
2011-04-13 08:20:13.167 
[cssd(17068)]CRS-1707:Lease acquisition for node node2 number 2 completed 
pay attention at the timestamp 08:10:07.143 & 08:19:49.386

The error in the oracssdagent_root.log is
 
2011-04-13 08:09:49.286: [CLSFRAME][3014212592] New Framework state: 2 
2011-04-13 08:09:49.286: [CLSFRAME][3014212592] M2M is starting... 
2011-04-13 08:09:49.288: [ CRSCOMM][3014212592] Ipc: Starting send thread 
2011-04-13 08:09:49.288: [ CRSCOMM][1092061504] Ipc: sendWork thread started. 
2011-04-13 08:09:49.289: [ CRSCOMM][1105643840] IpcC: IPC Client thread started listening 
2011-04-13 08:09:49.289: [ CRSCOMM][1105643840] IpcC: Received member number of 10 
2011-04-13 08:09:49.290: [CLSFRAME][3014212592] New IPC Member:{Relative|Node:0|Process:0|Type:2}:OHASD:node2 
2011-04-13 08:09:49.290: [CLSFRAME][3014212592] New process connected to us ID:{Relative|Node:0|Process:0|Type:2} Info:OHASD:node2 
2011-04-13 08:09:49.291: [CLSFRAME][3014212592] Tints initialized with nodeId: 0 procId: 10 
2011-04-13 08:09:49.291: [CLSFRAME][3014212592] Starting thread model named: MultiThread 
2011-04-13 08:09:49.292: [CLSFRAME][3014212592] Starting thread model named: TimerSharedTM 
2011-04-13 08:09:49.293: [CLSFRAME][3014212592] New Framework state: 3 
2011-04-13 08:09:49.293: [    AGFW][3014212592] Agent Framework started successfully 
2011-04-13 08:09:49.293: [    AGFW][1116150080] {0:10:2} Agfw engine module has enabled... 
2011-04-13 08:09:49.293: [CLSFRAME][1116150080] {0:10:2} Module Enabling is complete 
2011-04-13 08:09:49.293: [CLSFRAME][1116150080] {0:10:2} New Framework state: 6 
2011-04-13 08:09:49.294: [CLSFRAME][3014212592] M2M is now powered by a doWork() thread. 
2011-04-13 08:09:49.294: [    AGFW][1116150080] {0:10:2} Agent is started with userid: root , expected user: root 
2011-04-13 08:09:49.294: [   AGENT][1116150080] {0:10:2} Static Version 11.2.0.2.0 
2011-04-13 08:09:49.294: [    AGFW][1116150080] {0:10:2} Agent sending message to PE: AGENT_HANDSHAKE[Proxy] ID 20484:11 
2011-04-13 08:09:49.302: [    AGFW][1116150080] {0:10:2} Agent received the message: RESTYPE_ADD[ora.cssd.type] ID 8196:12358 
2011-04-13 08:09:49.302: [    AGFW][1116150080] {0:10:2} Added new restype: ora.cssd.type 
2011-04-13 08:09:49.303: [    AGFW][1116150080] {0:10:2} Agent sending last reply for: RESTYPE_ADD[ora.cssd.type] ID 8196:12358 
2011-04-13 08:09:49.305: [    AGFW][1116150080] {0:10:2} Agent received the message: RESOURCE_ADD[ora.cssd 1 1] ID 4356:12359 
2011-04-13 08:09:49.305: [    AGFW][1116150080] {0:10:2} Added new resource: ora.cssd 1 1 to the agfw 
2011-04-13 08:09:49.306: [    AGFW][1116150080] {0:10:2} Agent sending last reply for: RESOURCE_ADD[ora.cssd 1 1] ID 4356:12359 
2011-04-13 08:09:49.308: [    AGFW][1116150080] {0:6:7} Agent received the message: RESOURCE_START[ora.cssd 1 1] ID 4098:12360 
2011-04-13 08:09:49.308: [    AGFW][1116150080] {0:6:7} Preparing START command for: ora.cssd 1 1 
2011-04-13 08:09:49.308: [    AGFW][1116150080] {0:6:7} ora.cssd 1 1 state changed from: UNKNOWN to: STARTING 
2011-04-13 08:09:49.309: [ora.cssd][1114048832] {0:6:7} [start] clsncssd_cssdstart: Start action called 
2011-04-13 08:09:49.309: [ora.cssd][1114048832] {0:6:7} [start] clsncssd_getattr: attr OMON_INITRATE, value 1000 
2011-04-13 08:09:49.309: [ora.cssd][1114048832] {0:6:7} [start] clsncssd_getattr: attr OMON_POLLRATE, value 500 
2011-04-13 08:09:49.309: [ora.cssd][1114048832] {0:6:7} [start] clsncssd_getattr: attr ORA_OPROCD_MODE, value 
2011-04-13 08:09:49.310: [ora.cssd][1114048832] {0:6:7} [start] clsncssd_getattr: attr PROCD_TIMEOUT, value 1000 
2011-04-13 08:09:49.310: [ora.cssd][1114048832] {0:6:7} [start] clsncssd_getattr: attr LOGGING_LEVEL, value 1 
2011-04-13 08:09:49.310: [ora.cssd][1114048832] {0:6:7} [start] clsncssd_cssdstart: loglevels CSSD=2,GIPCNM=2,GIPCGM=2,GIPCCM=2,CLSF=0,SKGFD=0,GPNP=1,OLR=0 
2011-04-13 08:09:49.313: [ora.cssd][1114048832] {0:6:7} [start] clsncssd_cssdstart: START action for resource /u01/oracle/installed/oracle_cluster-11.2.0.2-1/bin/ocssd: SUCCESS 
2011-04-13 08:09:49.313: [ora.cssd][1114048832] {0:6:7} [start] clsncssd_waitomon: start waiting 
2011-04-13 08:09:49.313: [ CSSCLNT][1098377536]clsssInitNative: Init for agent 
2011-04-13 08:09:50.317: [ CSSCLNT][1098377536]clsssInitNative: Init for agent 
2011-04-13 08:09:51.319: [ CSSCLNT][1098377536]clsssInitNative: Init for agent 
2011-04-13 08:09:51.322: [ CSSCLNT][1098377536]clssnsqueryfatal: css is fatal = 0 
2011-04-13 08:09:51.322: [ USRTHRD][1098377536] clsncssd_thrdspawn: spawn OPROCD succ 
2011-04-13 08:09:51.322: [ USRTHRD][1098377536] clsncssd_thrdspawn: spawn POLLMSG succ 
2011-04-13 08:09:51.323: [ USRTHRD][1099954496] clsnpollmsg_main: starting pollmsg thread 
2011-04-13 08:09:51.323: [ USRTHRD][1107745088] clsnproc_main: timeout of procd cannot be 0, now we set to default 1000. 
2011-04-13 08:09:51.323: [ USRTHRD][1117727040] clsnwork_main: starting worker thread 
2011-04-13 08:09:51.323: [ USRTHRD][1098377536] clsncssd_thrdspawn: spawn WORKER succ 
2011-04-13 08:09:51.323: [ USRTHRD][1107745088] clsnproc_main: starting oprocd 
2011-04-13 08:09:51.323: [ USRTHRD][1098377536] clsncssd_thrdspawn: spawn KILL succ 
2011-04-13 08:10:07.151: [ USRTHRD][1098377536] clsnomon_init: css init done, nodenum 2 
2011-04-13 08:10:07.151: [ USRTHRD][1098377536] clsnomon_WaitToRegister: waiting for first reconfiguration and kgzf initialization 
2011-04-13 08:19:49.385: [CLSFRAME][3014212592] TM [MultiThread] is changing desired thread # to 3. Current # is 2 
2011-04-13 08:19:49.387: [    AGFW][1111947584] {0:6:7} Created alert : (:CRSAGF00113:) :  Aborting the command: start for resource: ora.cssd 1 1 
2011-04-13 08:19:49.387: [ora.cssd][1111947584] {0:6:7} [start] clsncssd_cssdabort: sending shutdown abort to CSS with new ctx 
2011-04-13 08:19:49.387: [ CSSCLNT][1098377536]clsssRecvMsg: wrong type request (0) on 0xc9 ret 0 
2011-04-13 08:19:49.387: [ CSSCLNT][1098377536]clssnskgzfdone: RPC failed rc 1 

2011-04-13 08:19:49.387: [ USRTHRD][1098377536] clsnomon_WaitToRegister: exadata initialization completed with rc=1 
2011-04-13 08:19:49.387: [ USRTHRD][1098377536] clsnomon_init: problems in the CSS to allow OMON registration 2 
2011-04-13 08:19:49.387: [ USRTHRD][1098377536] clsnomon_cleanup: to exit status = 2 
2011-04-13 08:19:49.387: [ USRTHRD][1098377536] clsnomon_cleanup: failure, sending shutdown immediate to CSS 
2011-04-13 08:19:49.387: [ USRTHRD][1098377536] CHECK action is in progress, Rejecting the check action requested by entry point for ora.cssd 
2011-04-13 08:19:49.426: [    AGFW][2008402928] Starting the agent: /u01/oracle/installed/oracle_cluster-11.2.0.2-1/log/node2/agent/ohasd/oracssdagent_root/ 
2011-04-13 08:19:49.426: [   AGENT][2008402928] Agent framework initialized, Process Id = 17013 
2011-04-13 08:19:49.426: [ USRTHRD][2008402928] to enter agent main 
2011-04-13 08:19:49.426: [ USRTHRD][2008402928] clsscssd_main: New soft limit for stack size is 1572864, hard limit is 4294967295 
2011-04-13 08:19:49.434: [ USRTHRD][2008402928] clsncssd_main: setting priority to 4 
2011-04-13 08:19:49.434: [ USRTHRD][2008402928]  *** Agent Framework Started *** 
Do you have any idea why it took so long to bring all the components up & running?

Thanks a lot!!

G
Comments
Locked Post
New comments cannot be posted to this locked post.
Post Details
Locked on May 13 2011
Added on Apr 13 2011
11 comments
4,635 views