Database Software

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

RAC 12c behavior when private interconnect goes down

798463Sep 11 2014 — edited Sep 12 2014

Hello,

I'm doing some tests regarding ha and behavior in case of failures.

I have a 2 nodes RAC with 12.1.0.1.0 SE and Oracle Linux 6.5 (running kernel is 2.6.39-400.215.3.el6uek.x86_64)

their hostnames racsvi1 and racsvi2

I have public network on bond0 (2 interfaces) and private on bond1 (other 2 interfaces).

I have also configured acfs, used for archived redo logs and DIRECTORY objects.

I have also configured fencing via ipmi on iDRAC interfaces (the two nodes are Dell M610 blades).

I verified fencing and failover behavior is ok in case of storage connectivity issues of one node: node is evicted and the other one continues to work after thread recovery.

Now I'm verifying what happens in case of private network partition (aka split brain in my 2-nodes case).

Where can I find official expected behavior for 12c? Any link is appreciated.

I found some test cases on the internet for 11gR2 RACs but it seems they are not the same I'm experiencing in 12c.

My test case is this:

- on master node racsvi2 (from "oclumon manage -get master" command) I run ifdown of both slave interfaces of bond1, one at a time

(I know this is not the more correct workflow, but at this time I can only proceed this way, next week I should have also network guys time for me)

The first ifdown (on the interface that was the active one in the active-backup bond configuration) creates no problems as expected.

The second one generates fencing/eviction of the other node: is this expected? master always wins in this scenario?

While the other node reboots I reactivate the bond1 interfaces, so that private network is again ok for cluster rejoin.

With crs_stat I see that some crucial resources are OFFLINE on the surviving node (my db name is aul)

ora.aul.db

ora.oradg.acfsaul.acfs

ora.proxy_advm

ora.ORADG.ACFSAUL.advm (state details is "Unable to connect to ASM")

ora.asm is in INTERMEDIATE

ora.racsvi1.vip is in INTERMEDIATE (state details is "FAILED OVER")

So the db has gone down and remains in this state also when the fenced node rejoins the cluster.

cluster vip and one of the scan ip come back to racsvi1, but I continue to have

OFFLINE for both

ora.aul.db

ora.oradg.acfsaul.acfs

ora.proxy_advm

ora.ORADG.ACFSAUL.advm

OFFLINE fo racsvi1

ora.ORADG.dg

ora.OCR.dg

In this first test I solved with

shutdown -r racsvi1

shutdown -h racsvi2

shutdown -r racsvi1

and after about 6-8 minutes all has come on automatically.

I suppose there is something about problematic resiliency for ACFS component, as at the time of private network completely down, I have this in alert.log of surviving node:

Tue Sep 09 16:58:44 2014

SKGXP: ospid 7827: network interface with IP address 169.254.170.106 no longer running (check cable)

Tue Sep 09 16:59:17 2014

Reconfiguration started (old inc 12, new inc 14)

List of instances:

2 (myinst: 2)

Global Resource Directory frozen

* dead instance detected - domain 0 invalid = TRUE

Communication channels reestablished

Master broadcasted resource hash value bitmaps

Non-local Process blocks cleaned out

Tue Sep 09 16:59:17 2014

LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived

Tue Sep 09 16:59:17 2014

LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived

Set master node info

Submitted all remote-enqueue requests

Dwn-cvts replayed, VALBLKs dubious

All grantable enqueues granted

Tue Sep 09 16:59:17 2014

minact-scn: master found reconf/inst-rec before recscn scan old-inc#:12 new-inc#:12

Tue Sep 09 16:59:17 2014

Post SMON to start 1st pass IR

Tue Sep 09 16:59:17 2014

Instance recovery: looking for dead threads

Tue Sep 09 16:59:17 2014

Submitted all GCS remote-cache requests

Fix write in gcs resources

Tue Sep 09 16:59:18 2014

Beginning instance recovery of 1 threads

Tue Sep 09 16:59:18 2014

Started redo scan

Tue Sep 09 16:59:18 2014

Reconfiguration complete (total time 0.1 secs)

Tue Sep 09 16:59:19 2014

Completed redo scan

read 2678 KB redo, 322 data blocks need recovery

Tue Sep 09 16:59:19 2014

Started redo application at

Thread 1: logseq 379, block 4704

Tue Sep 09 16:59:19 2014

Recovery of Online Redo Log: Thread 1 Group 1 Seq 379 Reading mem 0

Mem# 0: +ORADG/AUL/ONLINELOG/group_1.262.852290443

Tue Sep 09 16:59:19 2014

Completed redo application of 1.38MB

Tue Sep 09 16:59:19 2014

Completed instance recovery at

Thread 1: logseq 379, block 10061, scn 9205866376286

290 data blocks read, 345 data blocks written, 2678 redo k-bytes read

Tue Sep 09 16:59:19 2014

Thread 1 advanced to log sequence 380 (thread recovery)

Tue Sep 09 16:59:19 2014

Redo thread 1 internally disabled at seq 380 (SMON)

Tue Sep 09 16:59:19 2014

Errors in file /databases/app/oracle12/diag/rdbms/aul/AUL2/trace/AUL2_arc1_7920.trc:

ORA-07286: sksagdi: cannot obtain device information.

Linux-x86_64 Error: 5: Input/output error

Tue Sep 09 16:59:19 2014

Cannot translate archive destination string 'LOG_ARCHIVE_DEST_1'

Thanks in advance for pointers and hints.

Gianluca

Locked Post

New comments cannot be posted to this locked post.

Locked on Oct 10 2014

Added on Sep 11 2014

#performance-availability, #real-application-clusters

2 comments

3,863 views