Hi, I'm investigating the behavior the cause of and subsequent behavior after a hard timeout of a guarded service. In my experience, the grid members are unable to recover properly. I am trying to figure out whether there is something in our configuration that may be aggravating the situation, and also see whether I might be able to improve the behavior of our client code.
What I have done is purposely lower the service guardian's timeout, to about 25 seconds, so that a certain EntryProcessor will always time-out. The behavior I see after the timeout is very similar to the behavior we see when the problem crops up in the real world. I know that I can raise the timeout or use the PriorityTask API, but even if I do that, we may run into timeouts due to deadlock. If that happens, I want to know what I can expect the grid to do.
Here's what I'm seeing:
I am running 2 storage-enabled members with backup-count=0. I am running my "client" as a non-storage-enabled member.
First, the client kicks off the EntryProcessor:
Map<Object, Integer> readProcessorResult = requestCache.invokeAll(AlwaysFilter.INSTANCE, processor);
The two members begin logging successful executions of the EntryProcessor. Then, one of the members happens to run it against an entry that takes longer than the service guardian's timeout. We get a soft and hard timeout:
[ERROR] Coherence: 2012-01-26 10:56:20.103/333.297Oracle Coherence GE 3.7.1.2 <Error> (thread=Cluster, member=1): Detected soft timeout) of {WrapperGuardable Guard{Daemon=DistributedCache:jdw-read-request-service} Service=PartitionedCache{Name=jdw-read-request-service, State=(SERVICE_STARTED), LocalStorage=enabled, PartitionCount=257, BackupCount=0, AssignedPartitions=128}}
[WARN ] Coherence: 2012-01-26 10:56:20.121/333.315Oracle Coherence GE 3.7.1.2 <Warning> (thread=Recovery Thread, member=1): Attempting recovery of Guard{Daemon=DistributedCache:jdw-read-request-service}
[ERROR] Coherence: 2012-01-26 10:56:22.604/335.798Oracle Coherence GE 3.7.1.2 <Error> (thread=Cluster, member=1): Detected hard timeout) of {WrapperGuardable Guard{Daemon=DistributedCache:jdw-read-request-service} Service=PartitionedCache{Name=jdw-read-request-service, State=(SERVICE_STARTED), LocalStorage=enabled, PartitionCount=257, BackupCount=0, AssignedPartitions=128}}
Now, for some reason (why?), Coherence determines that the service is unrecoverable and it must stop the cluster:
[WARN ] Coherence: 2012-01-26 10:56:22.605/335.799Oracle Coherence GE 3.7.1.2 <Warning> (thread=Termination Thread, member=1): Terminating Guard{Daemon=DistributedCache:jdw-read-request-service}
Coherence <Error>: Halting this cluster node due to unrecoverable service failure
[ERROR] Coherence: 2012-01-26 10:56:23.613/336.807Oracle Coherence GE 3.7.1.2 <Error> (thread=Cluster, member=1): StopRunning ClusterService{Name=Cluster, State=(SERVICE_STARTED, STATE_JOINED), Id=0, Version=3.7.1, OldestMemberId=1} due to unhandled exception:
[ERROR] Coherence: 2012-01-26 10:56:23.613/336.807Oracle Coherence GE 3.7.1.2 <Error> (thread=PacketListener1P, member=1): Stopping cluster due to unhandled exception: com.tangosol.net.messaging.ConnectionException: UdpSocket.receive: unable to reopen socket; State=STATE_CLOSED
at com.tangosol.coherence.component.net.socket.UdpSocket.receive(UdpSocket.CDB:58)
at com.tangosol.coherence.component.net.UdpPacket.receive(UdpPacket.CDB:1)
at com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketListener.onNotify(PacketListener.CDB:20)
at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
at java.lang.Thread.run(Unknown Source)
[ERROR] Coherence: 2012-01-26 10:56:23.641/336.835Oracle Coherence GE 3.7.1.2 <Error> (thread=Cluster, member=n/a):
java.lang.NullPointerException: null
at com.tangosol.coherence.component.net.Cluster$ClusterService$TcpRing.onAcceptException(Cluster.CDB:13) ~[coherence-3.7.1.2.jar:3.7.1.2]
at com.tangosol.coherence.component.net.TcpRing.onAccept(TcpRing.CDB:25) ~[coherence-3.7.1.2.jar:3.7.1.2]
at com.tangosol.coherence.component.net.TcpRing.onSelect(TcpRing.CDB:27) ~[coherence-3.7.1.2.jar:3.7.1.2]
at com.tangosol.coherence.component.net.TcpRing.select(TcpRing.CDB:14) ~[coherence-3.7.1.2.jar:3.7.1.2]
at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.ClusterService.onWait(ClusterService.CDB:6) ~[coherence-3.7.1.2.jar:3.7.1.2]
at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39) ~[coherence-3.7.1.2.jar:3.7.1.2]
at java.lang.Thread.run(Unknown Source) [na:1.6.0_30]
[WARN ] Coherence: 2012-01-26 10:56:24.976/338.170Oracle Coherence GE 3.7.1.2 <Warning> (thread=Invocation:jdw-invocation-service, member=n/a): failed to stop 95 worker threads; abandoning
Coherence <Error>: Halted the cluster:
Cluster is not running: State=5
Exception in thread "Cluster|SERVICE_STOPPED|Member(Id=1, Timestamp=2012-01-26 10:50:58.044, Address=192.168.1.67:9001, MachineId=52295, Location=site:,machine:DEN12956L,process:10012, Role=CoherenceServer)" java.nio.channels.ClosedSelector
Exception
at sun.nio.ch.SelectorImpl.keys(Unknown Source)
at com.tangosol.coherence.component.net.TcpRing.disconnectAll(TcpRing.CDB:6)
at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.ClusterService$TcpRing.onLeft(ClusterService.CDB:4)
at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.ClusterService.onStopRunning(ClusterService.CDB:7)
at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.ClusterService.onException(ClusterService.CDB:28)
at com.tangosol.coherence.component.net.Cluster$ClusterService.onException(Cluster.CDB:7)
at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:85)
[ERROR] Coherence: 2012-01-26 10:56:25.616/338.810 Oracle Coherence GE 3.7.1.2 <Error> (thread=Cluster, member=n/a): StopRunning ClusterService{Name=Cluster, State=(SERVICE_STOPPED, STATE_JOINED), Id=0, Version=3.7.1} due to unhandled exception:
at java.lang.Thread.run(Unknown Source)
[ERROR] Coherence: 2012-01-26 10:56:25.616/338.810Oracle Coherence GE 3.7.1.2 <Error> (thread=Cluster, member=n/a):
java.nio.channels.ClosedSelectorException: null
at sun.nio.ch.SelectorImpl.keys(Unknown Source) ~[na:1.6.0_30]
at com.tangosol.coherence.component.net.TcpRing.close(TcpRing.CDB:11) ~[coherence-3.7.1.2.jar:3.7.1.2]
at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.ClusterService.onExit(ClusterService.CDB:1) ~[coherence-3.7.1.2.jar:3.7.1.2]
at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:68) ~[coherence-3.7.1.2.jar:3.7.1.2]
at java.lang.Thread.run(Unknown Source) [na:1.6.0_30]
Around the same time that member 1 decided to stop its cluster, member 2, which had been happily executing entry processors, begins taking responsibility for member 1's partitions. That's what I'd expect.
[WARN ] Coherence: 2012-01-26 10:56:23.643/336.784Oracle Coherence GE 3.7.1.2 <Warning> (thread=DistributedCache:sys-id-dist-service, member=2): Assigned 128 orphaned primary partitions
[WARN ] Coherence: 2012-01-26 10:56:23.646/336.787Oracle Coherence GE 3.7.1.2 <Warning> thread=DistributedCache:sourceMetadataReviewCache-service, member=2): Assigned 128 orphaned primary partitions
......
Member 1 now restarts its cluster and re-joins as member 4, and starts re-starting its services:
[INFO ] Coherence: 2012-01-26 10:56:26.008/339.202Oracle Coherence GE 3.7.1.2 <Info> (thread=main, member=n/a): Restarting cluster
[INFO ] Coherence: 2012-01-26 10:56:26.327/339.521Oracle Coherence GE 3.7.1.2 <Info> (thread=Cluster, member=n/a): This Member(Id=4, Timestamp=2012-01-26 10:56:26.126, Address=192.168.1.67:9001, MachineId=52295, Location=site:,machine:DEN12956L,process:10012, Role=CoherenceServer, Edition=Grid Edition, Mode=Development, CpuCount=2, SocketCount=2) joined cluster "NIR_GRID_DEV" with senior Member(Id=2, Timestamp=2012-01-26 10:51:03.593, Address=192.168.1.67:9003, MachineId=52295, Location=site:,machine:DEN12956L,process:10024, Role=CoherenceServer, Edition=Grid Edition, Mode=Development, CpuCount=2, SocketCount=2)
[INFO ] Coherence: 2012-01-26 10:56:26.364/339.558Oracle Coherence GE 3.7.1.2 <Info> (thread=main, member=n/a): Started cluster Name=NIR_GRID_DEV
Group{Address=224.3.7.0, Port=60000, TTL=0}
MasterMemberSet(
ThisMember=Member(Id=4, Timestamp=2012-01-26 10:56:26.126, Address=192.168.1.67:9001, MachineId=52295, Location=site:,machine:DEN12956L,process:10012, Role=CoherenceServer)
OldestMember=Member(Id=2, Timestamp=2012-01-26 10:51:03.593, Address=192.168.1.67:9003, MachineId=52295, Location=site:,machine:DEN12956L,process:10024, Role=CoherenceServer)
ActualMemberSet=MemberSet(Size=3
Member(Id=2, Timestamp=2012-01-26 10:51:03.593, Address=192.168.1.67:9003, MachineId=52295, Location=site:,machine:DEN12956L,process:10024, Role=CoherenceServer)
Member(Id=3, Timestamp=2012-01-26 10:55:05.522, Address=192.168.1.67:9005, MachineId=52295, Location=site:,machine:DEN12956L,process:13268, Role=IntellijRtExecutionAppMain)
Member(Id=4, Timestamp=2012-01-26 10:56:26.126, Address=192.168.1.67:9001, MachineId=52295, Location=site:,machine:DEN12956L,process:10012, Role=CoherenceServer)
)
MemberId|ServiceVersion|ServiceJoined|MemberState
2|3.7.1|2012-01-26 10:51:03.593|JOINED,
3|3.7.1|2012-01-26 10:55:05.522|JOINED,
4|3.7.1|2012-01-26 10:56:26.337|JOINED
RecycleMillis=1200000
RecycleSet=MemberSet(Size=0
)
)
TcpRing{Connections=[2, 3]}
IpMonitor{AddressListSize=0}
[INFO ] Coherence: 2012-01-26 10:56:26.365/339.559Oracle Coherence GE 3.7.1.2 <Info> (thread=main, member=4): Restarting Service:Management
[INFO ] Coherence: 2012-01-26 10:56:26.417/339.611Oracle Coherence GE 3.7.1.2 <Info> (thread=main, member=4): Restarting Service:jdwSourceManagerCache-service
......
While that is happening, member 2 now also encounters a hard timeout on the same service and begins going through the same process that member 1 just went through. I am not sure why member 2 encounters this timeout. Perhaps it is because it ran another entry processor that took too long? It is difficult to tell. Except this time we start having problems when member 2 tries to stop the cluster, and I'm not sure exactly why. Perhaps because member 4 is not yet in a good state?
[ERROR] Coherence: 2012-01-26 10:56:35.282/348.423Oracle Coherence GE 3.7.1.2 <Error> (thread=PacketListener1, member=2): Stopping cluster due to unhandled exception:
com.tangosol.net.messaging.ConnectionException: UdpSocket.receive: unable to reopen socket; State=STATE_CLOSED
at com.tangosol.coherence.component.net.socket.UdpSocket.receive(UdpSocket.CDB:58)
at com.tangosol.coherence.component.net.UdpPacket.receive(UdpPacket.CDB:1)
at com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketListener.onNotify(PacketListener.CDB:20)
at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
at java.lang.Thread.run(Unknown Source)
Coherence <Error>: Halted the cluster:
Cluster is not running: State=5
[ERROR] Coherence: 2012-01-26 10:56:42.147/355.288Oracle Coherence GE 3.7.1.2 <Error> (thread=main, member=n/a): Failed to restart services: com.tangosol.net.RequestTimeoutException: Timeout while waiting for cluster to stop.
[ERROR] Coherence: 2012-01-26 10:56:51.324/364.465Oracle Coherence GE 3.7.1.2 <Error> (thread=main, member=n/a): Failed to restart services: com.tangosol.net.RequestTimeoutException: Timeout while waiting for cluster to stop.
...... (error repeats)
And the newly re-joined member 4, which had been restarting services and claiming partitions from the newly-dead member 2, appears to restart execution of the entry processor but then complains that its thread was interrupted: I am not sure why this happens or what it means.
[INFO ] Coherence: 2012-01-26 10:56:52.685/365.879Oracle Coherence GE 3.7.1.2 <Info> (thread=DistributedCache:jdw-read-request-service, member=4): Restarting NamedCache: coherence.common.sequencegenerators
[ERROR] Coherence: 2012-01-26 10:56:52.686/365.880Oracle Coherence GE 3.7.1.2 <Error> (thread=DistributedCache:jdw-read-request-service, member=4): This thread was interrupted while waiting for the results of a request:
Poll
{
PollId=2, active
InitTimeMillis=1327597012685
Service=DistributedCacheForSequenceGenerators (60)
RespondedMemberSet=[]
LeftMemberSet=[]
RemainingMemberSet=[4]
},
Meanwhile, my client, who ran the invoke() with the EntryProcessor, has received a message stating "no storage enabled members exist," which although it may be true doesn't seem like it's really the cause of the problem.
Could I be having problems because all my cluster members are executing the same kind of long-running EntryProcessor and all of them are getting service guardian timeouts at around the same time? How do I deal with that? For example, what if a database is running very slowly one day and my entry processors all start to execute slowly? I don't want my whole grid having problems. Should I, for example, limit the number of concurrent EntryProcessors to < count of grid members by using a LimitFilter?
What does it mean that there was a timeout waiting for the cluster to stop?
And what about the client? If I get an exception after running a grid command, is it valid to wait a few seconds for things to stabilize and then try again? What's my best bet as far as that goes?
Thanks! And let me know if I should file a support request instead.