How long can a node be off-line and sucessfully rejoin cluster after reboot
Hi All,
I am looking for inputs here. We have a 2 node RAC, db version 10.2.0.4, using ASM, on Linux OS, IBM boxes. We have a situation where one of our nodes has been non-functional for 2 months. When I say non-functional I mean the server does not recognize it's own local disks. It is frozen, hung, non-functioning.
When we log onto the non-functional server, it says the cluster process is running. And on the functional node, the CRS log shows both nodes as active.
2011-04-19 10:05:34.611
[crsd(17985)]CRS-1204:Recovering CRS resources for node xxx-xxxxx-2.
[cssd(18546)]CRS-1601:CSSD Reconfiguration complete. Active nodes are xxx-xxxxx-1 xxx-xxxxx-2 .
Nothing after this.
However, on the working node, the ASM log shows that node as having been evicted back in August.
List of nodes:
0
Global Resource Directory frozen
* dead instance detected - domain 1 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Aug 15 19:55:03 2011
LMS 0: 0 GCS shadows cancelled, 0 closed
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Mon Aug 15 19:55:03 2011
NOTE: SMON starting instance recovery for group 1 (mounted)
Mon Aug 15 19:55:03 2011
LMS 0: 5446 GCS shadows traversed, 0 replayed
Mon Aug 15 19:55:03 2011
Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
Mon Aug 15 19:55:03 2011
NOTE: F1X0 found on disk 0 fcn 0.1332669
NOTE: starting recovery of thread=1 ckpt=63.8866 group=1
Mon Aug 15 19:55:03 2011
NOTE: waiting for instance recovery of group 1
Mon Aug 15 19:55:03 2011
NOTE: advancing ckpt for thread=1 ckpt=63.8866
NOTE: smon did instance recovery for domain 1
Mon Aug 15 19:55:06 2011
NOTE: recovering COD for group 1/0xeaec03f1 (DGROUP1)
SUCCESS: completed COD recovery for group 1/0xeaec03f1 (DGROUP1)
My question is if the non-functional node comes back up after a reboot, is it possible the cluster will recognize it and resume it's work?
There have been no errors working nodes alert logs.
I understand each case is different; I am wanting to see what the experts think, will happen.