We encountered a problem that a TCP extend node restart it's cluster server but failed to restart the extend service.
1. We saw lot's of "Scheduled senior member heartbeat is overdue; rejoining multicast group" message. This cluster has been
running with multicast all the time and it ran fine even with those message logged from time to time. Is that something we
need to worry about?
2. From the log attached below, around 11:15:04, the Cluster service got stopped because of a socket failure. What's possible
root cause? This server is dedicated to run Coherence cluster, there are no other applications coexist on same server. This
extend node has been running a few days before this exception occured.
3. Around 11:15:05, the Cluster service has been restarted. However, not the TCP extend proxy service. Is this expected
behavior? If it is, how do we ensure that when Coherence restart Cluster service itself, it will also startup tcp extend service
specified in the cache config (the autostart flag is true)?
Regards,
Chen
2010-09-04 10:59:45,822|INFO|Coherence|2010-09-04 10:59:45.817/70285.741 Oracle Coherence EE 3.5.3/465p6 <Info> (thread=PacketListenerN, member=1): Scheduled senior member heartbeat is overdue; rejoining multicast group.
2010-09-04 10:59:45,843|WARN|Coherence|2010-09-04 10:59:45.843/70285.767 Oracle Coherence EE 3.5.3/465p6 <Warning> (thread=PacketPublisher, member=1): Experienced a 43761 ms communication delay (probable remote GC) with Member(Id=2, Timestamp=2010-09-04 09:18:54.931, Address=101.221.31.172:8095, MachineId=54344, Location=site:wlan.com,machine:c86,process:16852, Role=Node.102462030030000072); 67 packets rescheduled, PauseRate=0.075, Threshold=1681
2010-09-04 11:01:15,386|WARN|Coherence|2010-09-04 11:01:15.385/70375.309 Oracle Coherence EE 3.5.3/465p6 <Warning> (thread=PacketPublisher, member=1): Experienced a 3991 ms communication delay (probable remote GC) with Member(Id=9, Timestamp=2010-09-04 10:35:58.051, Address=101.221.31.172:8089, MachineId=54344, Location=site:wlan.com,machine:c86,process:18521, Role=Node.102462030030000075); 36 packets rescheduled, PauseRate=0.0114, Threshold=1406
2010-09-04 11:02:06,147|WARN|Coherence|2010-09-04 11:02:06.147/70426.071 Oracle Coherence EE 3.5.3/465p6 <Warning> (thread=PacketPublisher, member=1): Experienced a 1570 ms communication delay (probable remote GC) with Member(Id=8, Timestamp=2010-09-04 10:30:00.226, Address=101.221.31.172:8091, MachineId=54344, Location=site:wlan.com,machine:c86,process:16501, Role=JPY); 24 packets rescheduled, PauseRate=8.0E-4, Threshold=1188
2010-09-04 11:02:13,699|WARN|Coherence|2010-09-04 11:02:13.698/70433.622 Oracle Coherence EE 3.5.3/465p6 <Warning> (thread=PacketPublisher, member=1): Experienced a 1702 ms communication delay (probable remote GC) with Member(Id=5, Timestamp=2010-09-03 15:31:02.765, Address=101.221.31.172:8092, MachineId=54344, Location=site:wlan.com,machine:c86,process:16501, Role=EL); 25 packets rescheduled, PauseRate=0.0097, Threshold=190
2010-09-04 11:04:46,106|INFO|Coherence|2010-09-04 11:04:46.104/70586.028 Oracle Coherence EE 3.5.3/465p6 <Info> (thread=PacketListenerN, member=1): Scheduled senior member heartbeat is overdue; rejoining multicast group.
2010-09-04 11:04:46,131|WARN|Coherence|2010-09-04 11:04:46.131/70586.055 Oracle Coherence EE 3.5.3/465p6 <Warning> (thread=PacketPublisher, member=1): Experienced a 43200 ms communication delay (probable remote GC) with Member(Id=2, Timestamp=2010-09-04 09:18:54.931, Address=101.221.31.172:8095, MachineId=54344, Location=site:wlan.com,machine:c86,process:16852, Role=Node.102462030030000072); 85 packets rescheduled, PauseRate=0.0782, Threshold=1681
2010-09-04 11:06:15,525|WARN|Coherence|2010-09-04 11:06:15.524/70675.448 Oracle Coherence EE 3.5.3/465p6 <Warning> (thread=PacketPublisher, member=1): Experienced a 11352 ms communication delay (probable remote GC) with Member(Id=9, Timestamp=2010-09-04 10:35:58.051, Address=101.221.31.172:8089, MachineId=54344, Location=site:wlan.com,machine:c86,process:18521, Role=Node.102462030030000075); 81 packets rescheduled, PauseRate=0.0158, Threshold=1336
2010-09-04 11:09:46,004|INFO|Coherence|2010-09-04 11:09:46.002/70885.926 Oracle Coherence EE 3.5.3/465p6 <Info> (thread=PacketListenerN, member=1): Scheduled senior member heartbeat is overdue; rejoining multicast group.
2010-09-04 11:11:15,524|WARN|Coherence|2010-09-04 11:11:15.523/70975.447 Oracle Coherence EE 3.5.3/465p6 <Warning> (thread=PacketPublisher, member=1): Experienced a 3993 ms communication delay (probable remote GC) with Member(Id=9, Timestamp=2010-09-04 10:35:58.051, Address=101.221.31.172:8089, MachineId=54344, Location=site:wlan.com,machine:c86,process:18521, Role=Node.102462030030000075); 36 packets rescheduled, PauseRate=0.0154, Threshold=1270
2010-09-04 11:15:04,227|ERROR|Coherence|2010-09-04 11:15:04.226/71204.150 Oracle Coherence EE 3.5.3/465p6 <Error> (thread=TcpRingListener, member=1): Stopping cluster due to unhandled exception: com.tangosol.net.messaging.ConnectionException: TcpSocketAccepter.accept: unable to reopen socket; State=STATE_CLOSED
at com.tangosol.coherence.component.net.socket.TcpSocketAccepter.accept(TcpSocketAccepter.CDB:47)
at com.tangosol.coherence.component.util.daemon.TcpRingListener.acceptConnection(TcpRingListener.CDB:10)
at com.tangosol.coherence.component.util.daemon.TcpRingListener.onNotify(TcpRingListener.CDB:9)
at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
at java.lang.Thread.run(Unknown Source)
2010-09-04 11:15:04,228|ERROR|Coherence|2010-09-04 11:15:04.227/71204.151 Oracle Coherence EE 3.5.3/465p6 <Error> (thread=PacketPublisher, member=1): Failed to send diagnostic packet DiagnosticPacket{PacketType=0x0DDF00D0, ToId=2, FromId=1, Direction=Outgoing, SentCount=0, SentMillis=11:15:04.226, TTL=1} due to exception com.tangosol.net.messaging.ConnectionException: UdpSocket.send: unable to reopen socket; State=STATE_CLOSED
at com.tangosol.coherence.component.net.socket.UdpSocket.send(UdpSocket.CDB:64)
at com.tangosol.coherence.component.net.udpPacket.OutgoingUdpPacket.send(OutgoingUdpPacket.CDB:10)
at com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketPublisher.sendDiagnosticPacket(PacketPublisher.CDB:34)
at com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketPublisher$ResendQueue.onPacketLost(PacketPublisher.CDB:51)
at com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketPublisher$ResendQueue.onPacketLost(PacketPublisher.CDB:27)
at com.tangosol.coherence.component.net.Cluster$PacketPublisher$ResendQueue.onPacketLost(Cluster.CDB:1)
at com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketPublisher$ResendQueue.removeNoWait(PacketPublisher.CDB:61)
at com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketPublisher.onNotify(PacketPublisher.CDB:40)
at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
at java.lang.Thread.run(Unknown Source)
2010-09-04 11:15:04,242|ERROR|Coherence|2010-09-04 11:15:04.228/71204.152 Oracle Coherence EE 3.5.3/465p6 <Error> (thread=Proxy:ExtendTcpPofProxyService:TcpAcceptor, member=1): This thread was interrupted while waiting for the results of a request:
Poll
{
PollId=8962, active
InitTimeMillis=1283616840014
Service=DistributedCachePofStaging (12)
RespondedMemberSet=[]
LeftMemberSet=[]
RemainingMemberSet=[2]
},
2010-09-04 11:15:04,244|INFO|Coherence|2010-09-04 11:15:04.228/71204.152 Oracle Coherence EE 3.5.3/465p6 <Info> (thread=WrapperSimpleAppMain, member=1): Restarting Service: Management
2010-09-04 11:15:04,245|ERROR|Coherence|2010-09-04 11:15:04.232/71204.156 Oracle Coherence EE 3.5.3/465p6 <Error> (thread=PacketPublisher, member=1): Failed to send diagnostic packet DiagnosticPacket{PacketType=0x0DDF00D0, ToId=2, FromId=1, Direction=Outgoing, SentCount=0, SentMillis=11:15:04.227, TTL=1} due to exception com.tangosol.net.messaging.ConnectionException: UdpSocket.send: unable to reopen socket; State=STATE_CLOSED
at com.tangosol.coherence.component.net.socket.UdpSocket.send(UdpSocket.CDB:64)
at com.tangosol.coherence.component.net.udpPacket.OutgoingUdpPacket.send(OutgoingUdpPacket.CDB:10)
at com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketPublisher.sendDiagnosticPacket(PacketPublisher.CDB:34)
at com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketPublisher$ResendQueue.onPacketLost(PacketPublisher.CDB:51)
at com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketPublisher$ResendQueue.onPacketLost(PacketPublisher.CDB:27)
at com.tangosol.coherence.component.net.Cluster$PacketPublisher$ResendQueue.onPacketLost(Cluster.CDB:1)
at com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketPublisher$ResendQueue.removeNoWait(PacketPublisher.CDB:61)
at com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketPublisher.onNotify(PacketPublisher.CDB:40)
at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
at java.lang.Thread.run(Unknown Source)
2010-09-04 11:15:04,245|ERROR|Coherence|2010-09-04 11:15:04.242/71204.166 Oracle Coherence EE 3.5.3/465p6 <Error> (thread=DistributedCache:DistributedCachePof, member=1): validatePolls: This service timed-out due to unanswered handshake request. Manual intervention is required to stop the members that have not responded to this Poll
{
PollId=1182724, active
InitTimeMillis=1283616845038
Service=DistributedCachePof (7)
RespondedMemberSet=[]
LeftMemberSet=[]
RemainingMemberSet=[2]
}
Request=Message "SizeRequest"
{
FromMember=Member(Id=1, Timestamp=2010-09-03 15:28:28.071, Address=101.221.31.172:8088, MachineId=54344, Location=site:wlan.com,machine:c86,process:16471, Role=CMP.c86.3006075453)
FromMessageId=5695925
Internal=false
MessagePartCount=1
PendingCount=1
MessageType=22
ToPollId=0
Poll=null
Packets
{
[000]=Directed{PacketType=0x0DDF00D5, ToId=2, FromId=1, Direction=Outgoing, SentCount=5, SentMillis=11:14:46.297, ToMemberSet=null, ServiceId=7, MessageType=22, FromMessageId=5695925, ToMessageId=125055, MessagePartCount=1, MessagePartIndex=0, NackInProgress=false, ResendScheduled=11:14:46.497, Timeout=11:15:05.39, PendingResendSkips=0, DeliveryState=deferred, Body=0x00120C04009CE3CD94DBCA800101019102FFFFFFFF9FFFFFF8000002C06210DA1FF00000000000000000000003FFFFFFFF33FFFC00020000000000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000..., Body.length=1445}
}
Service=DistributedCache{Name=DistributedCachePof, State=(SERVICE_STOPPED), Not initialized}
ToMemberSet=MemberSet(Size=1, BitSetCount=1
Member(Id=2, Timestamp=2010-09-04 09:18:54.931, Address=101.221.31.172:8095, MachineId=54344, Location=site:wlan.com,machine:c86,process:16852, Role=Node.102462030030000072)
)
NotifySent=false
}
2010-09-04 11:15:05,352|INFO|Coherence|2010-09-04 11:15:05.351/71205.275 Oracle Coherence EE 3.5.3/465p6 <Info> (thread=WrapperSimpleAppMain, member=n/a): Restarting cluster
2010-09-04 11:15:06,298|INFO|Coherence|2010-09-04 11:15:06.297/71206.221 Oracle Coherence EE 3.5.3/465p6 <Info> (thread=Cluster, member=n/a): Failed to satisfy the variance: allowed=16, actual=69
2010-09-04 11:15:06,298|INFO|Coherence|2010-09-04 11:15:06.297/71206.221 Oracle Coherence EE 3.5.3/465p6 <Info> (thread=Cluster, member=n/a): Increasing allowable variance to 22
2010-09-04 11:15:07,569|INFO|Coherence|2010-09-04 11:15:07.569/71207.493 Oracle Coherence EE 3.5.3/465p6 <Info> (thread=Cluster, member=n/a): This Member(Id=4, Timestamp=2010-09-04 11:15:07.322, Address=101.221.31.172:8090, MachineId=54344, Location=site:wlan.com,machine:c86,process:16471, Role=CMP.c86.3006075453, Edition=Enterprise Edition, Mode=Production, CpuCount=64, SocketCount=64, PublisherSuccessRate=1.0, ReceiverSuccessRate=1.0, PauseRate=0.0, Threshold=2080, Paused=false, Deferring=false, OutstandingPackets=0, DeferredPackets=0, ReadyPackets=0, LastIn=n/a, LastOut=n/a, LastSlow=n/a) joined cluster "EOP" with senior Member(Id=5, Timestamp=2010-09-03 15:31:02.765, Address=101.221.31.172:8092, MachineId=54344, Location=site:wlan.com,machine:c86,process:16501, Role=EL, Edition=Enterprise Edition, Mode=Production, CpuCount=64, SocketCount=64, PublisherSuccessRate=1.0, ReceiverSuccessRate=1.0, PauseRate=0.0, Threshold=2080, Paused=false, Deferring=false, OutstandingPackets=1, DeferredPackets=0, ReadyPackets=0, LastIn=n/a, LastOut=33ms, LastSlow=n/a)
2010-09-04 11:15:09,329|INFO|Coherence|2010-09-04 11:15:09.328/71209.252 Oracle Coherence EE 3.5.3/465p6 <Info> (thread=WrapperSimpleAppMain, member=4): Restarting Service: Management
2010-09-04 11:16:07,858|ERROR|Coherence|2010-09-04 11:16:07.858/71267.782 Oracle Coherence EE 3.5.3/465p6 <Error> (thread=Invocation:Management, member=4): Attempting recovery (due to soft timeout) of Guard{Daemon=Invocation:Management:EventDispatcher}
2010-09-04 11:16:12,116|WARN|ExecOSCommand|Node.102462030030000072 : Coherence <Error>: Halting this cluster node due to unrecoverable service failure
2010-09-04 11:16:14,370|ERROR|Coherence|2010-09-04 11:16:14.364/71274.288 Oracle Coherence EE 3.5.3/465p6 <Error> (thread=Invocation:Management, member=4): Terminating guarded execution (due to hard timeout) of Guard{Daemon=Invocation:Management:EventDispatcher}
2010-09-04 11:17:12,841|WARN|Coherence|2010-09-04 11:17:12.840/71332.764 Oracle Coherence EE 3.5.3/465p6 <Warning> (thread=PacketPublisher, member=4): Timeout while delivering a packet; requesting the departure confirmation for Member(Id=2, Timestamp=2010-09-04 09:18:54.931, Address=101.221.31.172:8095, MachineId=54344, Location=site:wlan.com,machine:c86,process:16852, Role=Node.102462030030000072, PublisherSuccessRate=0.0454, ReceiverSuccessRate=1.0, PauseRate=0.4507, Threshold=2080, Paused=true, Deferring=false, OutstandingPackets=1, DeferredPackets=0, ReadyPackets=0, LastIn=69072ms, LastOut=120ms, LastSlow=n/a)
by MemberSet(Size=2, BitSetCount=2
Member(Id=5, Timestamp=2010-09-03 15:31:02.765, Address=101.221.31.172:8092, MachineId=54344, Location=site:wlan.com,machine:c86,process:16501, Role=EL)
Member(Id=6, Timestamp=2010-09-03 15:31:05.89, Address=101.221.31.172:8093, MachineId=54344, Location=site:wlan.com,machine:c86,process:16501, Role=ECP)
)
2010-09-04 11:17:12,844|INFO|Coherence|2010-09-04 11:17:12.844/71332.768 Oracle Coherence EE 3.5.3/465p6 <Info> (thread=Cluster, member=4): Member departure confirmed by MemberSet(Size=2, BitSetCount=2
Member(Id=5, Timestamp=2010-09-03 15:31:02.765, Address=101.221.31.172:8092, MachineId=54344, Location=site:wlan.com,machine:c86,process:16501, Role=EL)
Member(Id=6, Timestamp=2010-09-03 15:31:05.89, Address=101.221.31.172:8093, MachineId=54344, Location=site:wlan.com,machine:c86,process:16501, Role=ECP)
); removing Member(Id=2, Timestamp=2010-09-04 09:18:54.931, Address=101.221.31.172:8095, MachineId=54344, Location=site:wlan.com,machine:c86,process:16852, Role=Node.102462030030000072)
2010-09-04 11:21:26,283|WARN|Coherence|2010-09-04 11:21:26.283/71586.207 Oracle Coherence EE 3.5.3/465p6 <Warning> (thread=PacketPublisher, member=4): Experienced a 12587 ms communication delay (probable remote GC) with Member(Id=9, Timestamp=2010-09-04 10:35:58.051, Address=101.221.31.172:8089, MachineId=54344, Location=site:wlan.com,machine:c86,process:18521, Role=Node.102462030030000075); 76 packets rescheduled, PauseRate=0.0332, Threshold=2080
2010-09-04 11:26:25,275|WARN|Coherence|2010-09-04 11:26:25.275/71885.199 Oracle Coherence EE 3.5.3/465p6 <Warning> (thread=PacketPublisher, member=4): Experienced a 11241 ms communication delay (probable remote GC) with Member(Id=9, Timestamp=2010-09-04 10:35:58.051, Address=101.221.31.172:8089, MachineId=54344, Location=site:wlan.com,machine:c86,process:18521, Role=Node.102462030030000075); 71 packets rescheduled, PauseRate=0.0351, Threshold=1976
2010-09-04 11:31:25,288|WARN|Coherence|2010-09-04 11:31:25.288/72185.212 Oracle Coherence EE 3.5.3/465p6 <Warning> (thread=PacketPublisher, member=4): Experienced a 15727 ms communication delay (probable remote GC) with Member(Id=9, Timestamp=2010-09-04 10:35:58.051, Address=101.221.31.172:8089, MachineId=54344, Location=site:wlan.com,machine:c86,process:18521, Role=Node.102462030030000075); 92 packets rescheduled, PauseRate=0.0404, Threshold=1878
2010-09-04 11:36:22,840|WARN|Coherence|2010-09-04 11:36:22.840/72482.764 Oracle Coherence EE 3.5.3/465p6 <Warning> (thread=PacketPublisher, member=4): Experienced a 13000 ms communication delay (probable remote GC) with Member(Id=9, Timestamp=2010-09-04 10:35:58.051, Address=101.221.31.172:8089, MachineId=54344, Location=site:wlan.com,machine:c86,process:18521, Role=Node.102462030030000075); 79 packets rescheduled, PauseRate=0.0412, Threshold=1785
2010-09-04 11:41:25,648|WARN|Coherence|2010-09-04 11:41:25.648/72785.572 Oracle Coherence EE 3.5.3/465p6 <Warning> (thread=PacketPublisher, member=4): Experienced a 11631 ms communication delay (probable remote GC) with Member(Id=9, Timestamp=2010-09-04 10:35:58.051, Address=101.221.31.172:8089, MachineId=54344, Location=site:wlan.com,machine:c86,process:18521, Role=Node.102462030030000075); 72 packets rescheduled, PauseRate=0.0406, Threshold=1696
2010-09-04 11:46:21,036|WARN|Coherence|2010-09-04 11:46:21.035/73080.959 Oracle Coherence EE 3.5.3/465p6 <Warning> (thread=PacketPublisher, member=4): Experienced a 4411 ms communication delay (probable remote GC) with Member(Id=9, Timestamp=2010-09-04 10:35:58.051, Address=101.221.31.172:8089, MachineId=54344, Location=site:wlan.com,machine:c86,process:18521, Role=Node.102462030030000075); 38 packets rescheduled, PauseRate=0.0366, Threshold=1612
2010-09-04 11:51:22,423|WARN|Coherence|2010-09-04 11:51:22.423/73382.347 Oracle Coherence EE 3.5.3/465p6 <Warning> (thread=PacketPublisher, member=4): Experienced a 8841 ms communication delay (probable remote GC) with Member(Id=9, Timestamp=2010-09-04 10:35:58.051, Address=101.221.31.172:8089, MachineId=54344, Location=site:wlan.com,machine:c86,process:18521, Role=Node.102462030030000075); 59 packets rescheduled, PauseRate=0.0356, Threshold=1532
2010-09-04 12:01:21,139|WARN|Coherence|2010-09-04 12:01:21.139/73981.063 Oracle Coherence EE 3.5.3/465p6 <Warning> (thread=PacketPublisher, member=4): Experienced a 3130 ms communication delay (probable remote GC) with Member(Id=9, Timestamp=2010-09-04 10:35:58.051, Address=101.221.31.172:8089, MachineId=54344, Location=site:wlan.com,machine:c86,process:18521, Role=Node.102462030030000075); 31 packets rescheduled, PauseRate=0.0291, Threshold=1384