Hello,
I'm doing some tests regarding ha and behavior in case of failures.
I have a 2 nodes RAC with 12.1.0.1.0 SE and Oracle Linux 6.5 (running kernel is 2.6.39-400.215.3.el6uek.x86_64)
their hostnames racsvi1 and racsvi2
I have public network on bond0 (2 interfaces) and private on bond1 (other 2 interfaces).
I have also configured acfs, used for archived redo logs and DIRECTORY objects.
I have also configured fencing via ipmi on iDRAC interfaces (the two nodes are Dell M610 blades).
I verified fencing and failover behavior is ok in case of storage connectivity issues of one node: node is evicted and the other one continues to work after thread recovery.
Now I'm verifying what happens in case of private network partition (aka split brain in my 2-nodes case).
Where can I find official expected behavior for 12c? Any link is appreciated.
I found some test cases on the internet for 11gR2 RACs but it seems they are not the same I'm experiencing in 12c.
My test case is this:
- on master node racsvi2 (from "oclumon manage -get master" command) I run ifdown of both slave interfaces of bond1, one at a time
(I know this is not the more correct workflow, but at this time I can only proceed this way, next week I should have also network guys time for me)
The first ifdown (on the interface that was the active one in the active-backup bond configuration) creates no problems as expected.
The second one generates fencing/eviction of the other node: is this expected? master always wins in this scenario?
While the other node reboots I reactivate the bond1 interfaces, so that private network is again ok for cluster rejoin.
With crs_stat I see that some crucial resources are OFFLINE on the surviving node (my db name is aul)
ora.aul.db
ora.oradg.acfsaul.acfs
ora.proxy_advm
ora.ORADG.ACFSAUL.advm (state details is "Unable to connect to ASM")
ora.asm is in INTERMEDIATE
ora.racsvi1.vip is in INTERMEDIATE (state details is "FAILED OVER")
So the db has gone down and remains in this state also when the fenced node rejoins the cluster.
cluster vip and one of the scan ip come back to racsvi1, but I continue to have
OFFLINE for both
ora.aul.db
ora.oradg.acfsaul.acfs
ora.proxy_advm
ora.ORADG.ACFSAUL.advm
OFFLINE fo racsvi1
ora.ORADG.dg
ora.OCR.dg
In this first test I solved with
shutdown -r racsvi1
shutdown -h racsvi2
shutdown -r racsvi1
and after about 6-8 minutes all has come on automatically.
I suppose there is something about problematic resiliency for ACFS component, as at the time of private network completely down, I have this in alert.log of surviving node:
Tue Sep 09 16:58:44 2014
SKGXP: ospid 7827: network interface with IP address 169.254.170.106 no longer running (check cable)
Tue Sep 09 16:59:17 2014
Reconfiguration started (old inc 12, new inc 14)
List of instances:
2 (myinst: 2)
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Tue Sep 09 16:59:17 2014
LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Tue Sep 09 16:59:17 2014
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Tue Sep 09 16:59:17 2014
minact-scn: master found reconf/inst-rec before recscn scan old-inc#:12 new-inc#:12
Tue Sep 09 16:59:17 2014
Post SMON to start 1st pass IR
Tue Sep 09 16:59:17 2014
Instance recovery: looking for dead threads
Tue Sep 09 16:59:17 2014
Submitted all GCS remote-cache requests
Fix write in gcs resources
Tue Sep 09 16:59:18 2014
Beginning instance recovery of 1 threads
Tue Sep 09 16:59:18 2014
Started redo scan
Tue Sep 09 16:59:18 2014
Reconfiguration complete (total time 0.1 secs)
Tue Sep 09 16:59:19 2014
Completed redo scan
read 2678 KB redo, 322 data blocks need recovery
Tue Sep 09 16:59:19 2014
Started redo application at
Thread 1: logseq 379, block 4704
Tue Sep 09 16:59:19 2014
Recovery of Online Redo Log: Thread 1 Group 1 Seq 379 Reading mem 0
Mem# 0: +ORADG/AUL/ONLINELOG/group_1.262.852290443
Tue Sep 09 16:59:19 2014
Completed redo application of 1.38MB
Tue Sep 09 16:59:19 2014
Completed instance recovery at
Thread 1: logseq 379, block 10061, scn 9205866376286
290 data blocks read, 345 data blocks written, 2678 redo k-bytes read
Tue Sep 09 16:59:19 2014
Thread 1 advanced to log sequence 380 (thread recovery)
Tue Sep 09 16:59:19 2014
Redo thread 1 internally disabled at seq 380 (SMON)
Tue Sep 09 16:59:19 2014
Errors in file /databases/app/oracle12/diag/rdbms/aul/AUL2/trace/AUL2_arc1_7920.trc:
ORA-07286: sksagdi: cannot obtain device information.
Linux-x86_64 Error: 5: Input/output error
Tue Sep 09 16:59:19 2014
Cannot translate archive destination string 'LOG_ARCHIVE_DEST_1'
Thanks in advance for pointers and hints.
Gianluca