Database Software

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

How long can a node be off-line and sucessfully rejoin cluster after reboot

OldSchoolDBAOct 11 2011 — edited Oct 13 2011

Hi All,

I am looking for inputs here. We have a 2 node RAC, db version 10.2.0.4, using ASM, on Linux OS, IBM boxes. We have a situation where one of our nodes has been non-functional for 2 months. When I say non-functional I mean the server does not recognize it's own local disks. It is frozen, hung, non-functioning.

When we log onto the non-functional server, it says the cluster process is running. And on the functional node, the CRS log shows both nodes as active.

2011-04-19 10:05:34.611
[crsd(17985)]CRS-1204:Recovering CRS resources for node xxx-xxxxx-2.
[cssd(18546)]CRS-1601:CSSD Reconfiguration complete. Active nodes are xxx-xxxxx-1 xxx-xxxxx-2 .

Nothing after this.

However, on the working node, the ASM log shows that node as having been evicted back in August.

List of nodes:
0
Global Resource Directory frozen
* dead instance detected - domain 1 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Aug 15 19:55:03 2011
LMS 0: 0 GCS shadows cancelled, 0 closed
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Mon Aug 15 19:55:03 2011
NOTE: SMON starting instance recovery for group 1 (mounted)
Mon Aug 15 19:55:03 2011
LMS 0: 5446 GCS shadows traversed, 0 replayed
Mon Aug 15 19:55:03 2011
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Mon Aug 15 19:55:03 2011
NOTE: F1X0 found on disk 0 fcn 0.1332669
NOTE: starting recovery of thread=1 ckpt=63.8866 group=1
Mon Aug 15 19:55:03 2011
NOTE: waiting for instance recovery of group 1
Mon Aug 15 19:55:03 2011
NOTE: advancing ckpt for thread=1 ckpt=63.8866
NOTE: smon did instance recovery for domain 1
Mon Aug 15 19:55:06 2011
NOTE: recovering COD for group 1/0xeaec03f1 (DGROUP1)
SUCCESS: completed COD recovery for group 1/0xeaec03f1 (DGROUP1)

My question is if the non-functional node comes back up after a reboot, is it possible the cluster will recognize it and resume it's work?

There have been no errors working nodes alert logs.

I understand each case is different; I am wanting to see what the experts think, will happen.

Locked Post

New comments cannot be posted to this locked post.

Locked on Nov 10 2011

Added on Oct 11 2011

#performance-availability, #real-application-clusters

2 comments

489 views