Skip to Main Content

Integration

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Member death detection (Coh-3.5.3)

Jonathan.KnightSep 22 2010 — edited Sep 23 2010
Hi,

Can someone explain death detection in 3.5.3 please. I have a reasonable idea how it works:

A member suspects another member has departed so it asks two other members for confirmation.
If these other members confirm departure then the original member informs the rest of the cluster that the member has departed.
Wherever possible the members being asked to confirm departure will be different roles to the member asking for confirmation

We occasionally lose storage nodes in this way but I have some questions around what I am seeing in the logs below.
The scenario is this:

* Member 27 has a timeout sending a packet to Member 16
* Member 27 asks Member 83 and member 85 to confirm departure of Member 16
* Member 83 rejects the confirmation request (@ 2010-09-22 05:21:43.411)
* Member 85 accepts the confirmation request (I assume it does as it has no rejection in its log)
* Member 27 informs the rest of the cluster that Member 16 has departed
* Member 1 (the senior member) heartbeats Member 16 causing it to re-initialise itself - it then rejoins as Member 127.

My question is given that Member 83 rejected the confirmation request I assume it could still see Member 16. What exactly are the rules around forcing a Member to depart the cluster when this happens.
The nodes asked to confirm departure were another storage node (which rejected the request) and a storage disabled worker node (which accepted the request).
These storage disabled nodes can sometimes be under reasonable load so might not be the best ones to ask to confirm departure.

What happens when only one of the two members confirms departure?
Can we choose which roles get asked to confirm departure?

Member 27
2010-09-22 05:21:43.410/648886.462 Oracle Coherence GE 3.5.3/465p2 <Warning> (thread=PacketPublisher, member=27): Timeout while delivering a packet Directed{PacketType=0x0DDF00D5, ToId=16, FromId=27, Direction=Outgoing, SentCount=79, SentMillis=05:21:43.111, ToMemberSet=null, ServiceId=7, MessageType=16, FromMessageId=32360401, ToMessageId=1730276, MessagePartCount=1, MessagePartIndex=0, NackInProgress=false, ResendScheduled=05:21:43.311, Timeout=05:21:43.15, PendingResendSkips=0, DeliveryState=outstanding, Body=0x0034D45C01001B012B110B545A001B012B110B5459004C021564BEA9FC8FE2CA80014C230D992515A16200A501843100004E084744532047424C4F40A6014E063834353235374000004CA90215A06200A401945F00A201BE2000A4014219A501A16200A501843100004E084744532047424C4F40A6014E063834353235374040..., Body.length=1445}; requesting the departure confirmation for Member(Id=16, Timestamp=2010-09-14 17:06:52.289, Address=xx.xxx.34.93:8088, MachineId=35165, Location=machine:xxxxx06428,process:8645,member:xxxxx06428:Data-2, Role=RbsOdcCoreDaoODCCacheServer)
by MemberSet(Size=2, BitSetCount=4
  Member(Id=83, Timestamp=2010-09-14 17:07:39.704, Address=xx.xxx.34.97:8091, MachineId=35169, Location=machine:xxxxx06432,process:25212,member:xxxxx06432:Data-6, Role=RbsOdcCoreDaoODCCacheServer)
   Member(Id=85, Timestamp=2010-09-21 00:43:23.208, Address=xx.xxx.34.88:8090, MachineId=35160, Location=machine:xxxxx06441,process:10718,member:xxxxx06441:VEST-6, Role=RbsOdcVestCoreVestMain)
   )
 2010-09-22 05:21:43,412 [Logger@9227652 3.5.3/465p2] INFO  Coherence - 2010-09-22 05:21:43.411/648886.463 Oracle Coherence GE 3.5.3/465p2 <Info> (thread=Cluster, member=27): Member departure confirmed by MemberSet(Size=1, BitSetCount=4
   Member(Id=85, Timestamp=2010-09-21 00:43:23.208, Address=xx.xxx.34.88:8090, MachineId=35160, Location=machine:xxxxx06441,process:10718,member:xxxxx06441:VEST-6, Role=RbsOdcVestCoreVestMain)
  ); removing Member(Id=16, Timestamp=2010-09-14 17:06:52.289, Address=xx.xxx.34.93:8088, MachineId=35165, Location=machine:xxxxx06428,process:8645,member:xxxxx06428:Data-2, Role=RbsOdcCoreDaoODCCacheServer)
2010-09-22 05:21:43.412/648886.464 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=27): Member 16 left service Management with senior member 1
Member 83
2010-09-22 05:19:06.003/648688.372 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=83): Member 100 joined Service PutAllInvocationService with senior member 1
648726.432: [GC 648726.432: [ParNew: 186837K->14089K(191744K), 0.0153790 secs] 2272524K->2100231K(2538752K), 0.0155370 secs] [Times: user=0.19 sys=0.00, real=0.01 secs] 
648782.121: [GC 648782.121: [ParNew: 184585K->8246K(191744K), 0.0077860 secs] 2270727K->2094904K(2538752K), 0.0079440 secs] [Times: user=0.09 sys=0.00, real=0.01 secs] 
648784.254: [GC 648784.254: [ParNew: 178742K->11283K(191744K), 0.0231890 secs] 2265400K->2097941K(2538752K), 0.0232940 secs] [Times: user=0.18 sys=0.00, real=0.02 secs] 
648840.470: [GC 648840.470: [ParNew: 180452K->8909K(191744K), 0.0078950 secs] 2267110K->2095568K(2538752K), 0.0080540 secs] [Times: user=0.06 sys=0.01, real=0.01 secs] 
648842.869: [GC 648842.869: [ParNew: 179405K->9775K(191744K), 0.0189500 secs] 2266064K->2096433K(2538752K), 0.0190970 secs] [Times: user=0.21 sys=0.00, real=0.02 secs] 
2010-09-22 05:21:43.411/648845.780 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=83): MemberLeft request for Member 16 received from Member(Id=27, Timestamp=2010-09-14 17:06:58.895, Address=xx.xxx.34.95:8090, MachineId=35167, Location=machine:xxxxx06430,process:8662,member:xxxxx06430:Data-1, Role=RbsOdcCoreDaoODCCacheServer)
2010-09-22 05:21:43.411/648845.780 Oracle Coherence GE 3.5.3/465p2 <Info> (thread=Cluster, member=83): Rejecting the departure confirmation request by Member(Id=27, Timestamp=2010-09-14 17:06:58.895, Address=xx.xxx.34.95:8090, MachineId=35167, Location=machine:xxxxx06430,process:8662,member:xxxxx06430:Data-1, Role=RbsOdcCoreDaoODCCacheServer) regarding Member(Id=16, Timestamp=2010-09-14 17:06:52.289, Address=xx.xxx.34.93:8088, MachineId=35165, Location=machine:xxxxx06428,process:8645,member:xxxxx06428:Data-2, Role=RbsOdcCoreDaoODCCacheServer)
2010-09-22 05:21:43.413/648845.782 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=83): MemberLeft notification for Member 16 received from Member(Id=27, Timestamp=2010-09-14 17:06:58.895, Address=xx.xxx.34.95:8090, MachineId=35167, Location=machine:xxxxx06430,process:8662,member:xxxxx06430:Data-1, Role=RbsOdcCoreDaoODCCacheServer)
2010-09-22 05:21:43.413/648845.782 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=83): Member 16 left service Management with senior member 1
Member 85
2010-09-22 05:19:05.894/102952.247 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=85): Member 100 joined Service distributed-pof-service with senior member 1
2010-09-22 05:19:06.003/102952.356 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=85): Member 100 joined Service PutAllInvocationService with senior member 1
09/22/10 05:21:07.667 INFO: [ProcessWrapper] [STDOUT] 103074.362: [GC 103074.362: [ParNew: 225038K->3371K(249216K), 0.0025490 secs] 782627K->560965K(2069504K), 0.0026190 secs] [Times: user=0.02 sys=0.00, real=0.00 secs] 
2010-09-22 05:21:43.411/103109.764 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=85): MemberLeft request for Member 16 received from Member(Id=27, Timestamp=2010-09-14 17:06:58.895, Address=xx.xxx.34.95:8090, MachineId=35167, Location=machine:xxxxx06430,process:8662,member:xxxxx06430:Data-1, Role=RbsOdcCoreDaoODCCacheServer)
2010-09-22 05:21:43.412/103109.765 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=85): MemberLeft notification for Member 16 received from Member(Id=27, Timestamp=2010-09-14 17:06:58.895, Address=xx.xxx.34.95:8090, MachineId=35167, Location=machine:xxxxx06430,process:8662,member:xxxxx06430:Data-1, Role=RbsOdcCoreDaoODCCacheServer)
2010-09-22 05:21:43.412/103109.765 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=85): Member 16 left service Management with senior member 1
Member 16
09/22/10 05:11:03.478 INFO: [ProcessWrapper] [STDOUT] 648253.507: [GC 648253.507: [ParNew: 173464K->2443K(191744K), 0.0128920 secs] 1401433K->1230671K(2538752K), 0.0130370 secs] [Times: user=0.06 sys=0.00, real=0.01 secs] 
09/22/10 05:12:18.307 INFO: [ProcessWrapper] [STDOUT] 648328.337: [GC 648328.337: [ParNew: 172939K->2972K(191744K), 0.0108240 secs] 1401167K->1231401K(2538752K), 0.0109550 secs] [Times: user=0.05 sys=0.01, real=0.01 secs] 
09/22/10 05:13:30.532 INFO: [ProcessWrapper] [STDOUT] 648400.564: [GC 648400.564: [ParNew: 173468K->2582K(191744K), 0.0095490 secs] 1401897K->1231266K(2538752K), 0.0097180 secs] [Times: user=0.07 sys=0.00, real=0.01 secs] 
09/22/10 05:14:51.958 INFO: [ProcessWrapper] [STDOUT] 648481.990: [GC 648481.990: [ParNew: 173078K->2969K(191744K), 0.0087620 secs] 1401762K->1231877K(2538752K), 0.0088810 secs] [Times: user=0.06 sys=0.00, real=0.01 secs] 
09/22/10 05:16:12.607 INFO: [ProcessWrapper] [STDOUT] 648562.641: [GC 648562.641: [ParNew: 173465K->2798K(191744K), 0.0067770 secs] 1402373K->1231945K(2538752K), 0.0069300 secs] [Times: user=0.07 sys=0.01, real=0.01 secs] 
09/22/10 05:17:23.249 INFO: [ProcessWrapper] [STDOUT] 648633.284: [GC 648633.284: [ParNew: 173294K->2824K(191744K), 0.0064570 secs] 1402441K->1232187K(2538752K), 0.0065950 secs] [Times: user=0.06 sys=0.00, real=0.01 secs] 
09/22/10 05:18:43.913 INFO: [ProcessWrapper] [STDOUT] 648713.948: [GC 648713.948: [ParNew: 173320K->2812K(191744K), 0.0065200 secs] 1402683K->1232354K(2538752K), 0.0066450 secs] [Times: user=0.05 sys=0.01, real=0.00 secs] 
2010-09-22 05:19:05.894/648735.426 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=16): Member 100 joined Service distributed-pof-service with senior member 1
2010-09-22 05:19:06.003/648735.535 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=16): Member 100 joined Service PutAllInvocationService with senior member 1
09/22/10 05:21:30.919 INFO: [ProcessWrapper] [STDOUT] 648880.948: [GC 648880.948: [ParNew: 173272K->3255K(191744K), 0.0126250 secs] 1403096K->1233261K(2538752K), 0.0127910 secs] [Times: user=0.07 sys=0.00, real=0.01 secs] 
2010-09-22 05:21:44.005/648893.537 Oracle Coherence GE 3.5.3/465p2 <Error> (thread=Cluster, member=16): Received cluster heartbeat from the senior Member(Id=1, Timestamp=2010-09-14 17:06:38.751, Address=xx.xxx.34.98:8088, MachineId=35170, Location=machine:xxxxx06433,process:20795,member:xxxxx06433:Data-1, Role=RbsOdcCoreDaoODCCacheServer) that does not contain this Member(Id=16, Timestamp=2010-09-14 17:06:52.289, Address=xx.xxx.34.93:8088, MachineId=35165, Location=machine:xxxxx06428,process:8645,member:xxxxx06428:Data-2, Role=RbsOdcCoreDaoODCCacheServer); stopping cluster service.
2010-09-22 05:21:44.005/648893.537 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=16): Service Cluster left the cluster
Any more information would be appreciated (or any settings I can tweak).

Cheers,
JK
Comments
Locked Post
New comments cannot be posted to this locked post.
Post Details
Locked on Oct 21 2010
Added on Sep 22 2010
9 comments
965 views