Hi,
We had this error:
Exception in thread "main" com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(Id=1, Name=Management, Type=Invocation...
I have read the knowledge base on MetaLink that covers this problem but there is something I don't understand.
Below is the log snippet from node 161 which is attempting to start up and join the cluster (apologies for the length)
08/11/10 04:49:56.368 INFO: [STDOUT] 2010-08-11 04:49:56,367 [Logger@9241638 3.5.3/465p2] DEBUG Coherence - 2010-08-11 04:49:56.329/7.789 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=n/a): Member 111 joined Service distributed-pof-service with senior member 1
08/11/10 04:49:56.368 INFO: [STDOUT] 2010-08-11 04:49:56,367 [Logger@9241638 3.5.3/465p2] DEBUG Coherence - 2010-08-11 04:49:56.329/7.789 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=n/a): Member 111 joined Service PartitionedFilterInvocationService with senior member 1
08/11/10 04:49:56.368 INFO: [STDOUT] 2010-08-11 04:49:56,367 [Logger@9241638 3.5.3/465p2] DEBUG Coherence - 2010-08-11 04:49:56.330/7.790 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=n/a): Member 35 joined Service Management with senior member 1
08/11/10 04:49:56.368 INFO: [STDOUT] 2010-08-11 04:49:56,367 [Logger@9241638 3.5.3/465p2] DEBUG Coherence - 2010-08-11 04:49:56.330/7.790 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=n/a): Member 35 joined Service Replicated-cache-service with senior member 1
08/11/10 04:49:56.368 INFO: [STDOUT] 2010-08-11 04:49:56,367 [Logger@9241638 3.5.3/465p2] DEBUG Coherence - 2010-08-11 04:49:56.330/7.790 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=n/a): Member 35 joined Service DistributedCache with senior member 1
08/11/10 04:49:56.368 INFO: [STDOUT] 2010-08-11 04:49:56,367 [Logger@9241638 3.5.3/465p2] DEBUG Coherence - 2010-08-11 04:49:56.330/7.790 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=n/a): Member 35 joined Service DistributedCacheWithBackingMapListener with senior member 1
08/11/10 04:49:56.368 INFO: [STDOUT] 2010-08-11 04:49:56,367 [Logger@9241638 3.5.3/465p2] DEBUG Coherence - 2010-08-11 04:49:56.330/7.790 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=n/a): Member 35 joined Service DistributedCacheForSequenceGenerators with senior member 1
08/11/10 04:49:56.368 INFO: [STDOUT] 2010-08-11 04:49:56,367 [Logger@9241638 3.5.3/465p2] DEBUG Coherence - 2010-08-11 04:49:56.330/7.790 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=n/a): Member 35 joined Service CommandPatternInvocationService with senior member 1
08/11/10 04:49:56.368 INFO: [STDOUT] 2010-08-11 04:49:56,367 [Logger@9241638 3.5.3/465p2] DEBUG Coherence - 2010-08-11 04:49:56.330/7.790 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=n/a): Member 35 joined Service latest-cache-service with senior member 1
08/11/10 04:49:56.368 INFO: [STDOUT] 2010-08-11 04:49:56,367 [Logger@9241638 3.5.3/465p2] DEBUG Coherence - 2010-08-11 04:49:56.330/7.790 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=n/a): Member 35 joined Service cptysystem-status-cache-service with senior member 1
08/11/10 04:49:56.368 INFO: [STDOUT] 2010-08-11 04:49:56,368 [Logger@9241638 3.5.3/465p2] DEBUG Coherence - 2010-08-11 04:49:56.330/7.790 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=n/a): Member 35 joined Service distributed-non-versioned-service with senior member 1
08/11/10 04:49:56.369 INFO: [STDOUT] 2010-08-11 04:49:56,368 [Logger@9241638 3.5.3/465p2] DEBUG Coherence - 2010-08-11 04:49:56.330/7.790 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=n/a): Member 35 joined Service distributed-pof-service with senior member 1
08/11/10 04:49:56.369 INFO: [STDOUT] 2010-08-11 04:49:56,368 [Logger@9241638 3.5.3/465p2] DEBUG Coherence - 2010-08-11 04:49:56.330/7.790 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=n/a): Member 35 joined Service PartitionedFilterInvocationService with senior member 1
08/11/10 04:49:56.373 INFO: [STDOUT] 2010-08-11 04:49:56,373 [Logger@9241638 3.5.3/465p2] DEBUG Coherence - 2010-08-11 04:49:56.373/7.833 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=161): TcpRing: connecting to member 24 using TcpSocket{State=STATE_OPEN, Socket=Socket[addr=/11.160.34.93,port=8091,localport=56611]}
08/11/10 04:49:56.384 INFO: [STDOUT] 2010-08-11 04:49:56,384 [Logger@9241638 3.5.3/465p2] DEBUG Coherence - 2010-08-11 04:49:56.384/7.844 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Invocation:Management, member=161): Service Management joined the cluster with senior service member 1
08/11/10 04:49:57.210 INFO: [CoherenceContainer] Started monitoring container "ODC VEST Oracle Coherence 3.5.3.465" for unexpected crashes...
08/11/10 04:49:57.211 INFO: [GridLibDomainManager] Started container "ODC VEST Oracle Coherence 3.5.3.465"
08/11/10 04:49:57.211 INFO: [CoherenceContainer] ODC VEST Oracle Coherence 3.5.3.465: installing "vest-ice-gblo-1.0.0"
08/11/10 04:49:57.214 INFO: [DomainManagerPlugin] Sending component activated message for "vest-ice-gblo-1.0.0"
08/11/10 04:49:57.216 INFO: [Statistics] Starting statistics collection for "vest-ice-gblo-1.0.0"
08/11/10 04:50:04.169 INFO: [STDERR] 8/11/10 04:50:04 AM BST [INFO] [IntroscopeAgent.HPC GC LastInfo Plugin] Last GC information not yet available for ParNew as the collector has not run
08/11/10 04:50:26.461 INFO: [STDERR] Exception in thread "main" com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(Id=1, Name=Management, Type=Invocation
08/11/10 04:50:26.461 INFO: [STDERR] MemberSet=ServiceMemberSet(
08/11/10 04:50:26.462 INFO: [STDERR] OldestMember=Member(Id=1, Timestamp=2010-08-01 12:01:52.122, Address=11.160.34.101:8088, MachineId=35173, Location=machine:lonrs06436,process:26160,member:lonrs06436:Data-4, Role=RbsOdcCoreDaoODCCacheServer)
08/11/10 04:50:26.462 INFO: [STDERR] ActualMemberSet=MemberSet(Size=162, BitSetCount=6
08/11/10 04:50:26.462 INFO: [STDERR] Member(Id=1, Timestamp=2010-08-01 12:01:52.122, Address=11.160.34.101:8088, MachineId=35173, Location=machine:lonrs06436,process:26160,member:lonrs06436:Data-4, Role=RbsOdcCoreDaoODCCacheServer)
08/11/10 04:50:26.462 INFO: [STDERR] Member(Id=2, Timestamp=2010-08-01 12:01:58.792, Address=11.160.34.92:8088, MachineId=35164, Location=machine:lonrs06427,process:1463,member:lonrs06427:Data-1, Role=RbsOdcCoreDaoODCCacheServer)
08/11/10 04:50:26.462 INFO: [STDERR] Member(Id=3, Timestamp=2010-08-01 12:02:10.598, Address=11.160.34.98:8088, MachineId=35170, Location=machine:lonrs06433,process:31728,member:lonrs06433:Extend-10000-0, Role=CoherenceServer)
08/11/10 04:50:26.462 INFO: [STDERR] Member(Id=4, Timestamp=2010-08-01 12:02:12.626, Address=11.160.34.93:8088, MachineId=35165, Location=machine:lonrs06428,process:32371,member:lonrs06428:Data-3, Role=RbsOdcCoreDaoODCCacheServer)
08/11/10 04:50:26.462 INFO: [STDERR] Member(Id=5, Timestamp=2010-08-01 12:02:14.38, Address=11.160.34.100:8088, MachineId=35172, Location=machine:lonrs06435,process:29337,member:lonrs06435:Data-1, Role=RbsOdcCoreDaoODCCacheServer)
08/11/10 04:50:26.462 INFO: [STDERR] Member(Id=6, Timestamp=2010-08-01 12:02:17.456, Address=11.160.34.101:8089, MachineId=35173, Location=machine:lonrs06436,process:26952,member:lonrs06436:Data-2, Role=RbsOdcCoreDaoODCCacheServer)
08/11/10 04:50:26.462 INFO: [STDERR] Member(Id=7, Timestamp=2010-08-10 22:30:04.559, Address=11.160.34.104:8089, MachineId=35176, Location=machine:lonrs06442,process:15400,member:lonrs06442:VEST-10)
08/11/10 04:50:26.462 INFO: [STDOUT] 2010-08-11 04:50:26,462 [Logger@9241638 3.5.3/465p2] ERROR Coherence - 2010-08-11 04:50:26.461/37.921 Oracle Coherence GE 3.5.3/465p2 <Error> (thread=main, member=161): Error while starting service "Management": com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(Id=1, Name=Management, Type=Invocation
08/11/10 04:50:26.462 INFO: [STDERR] Member(Id=8, Timestamp=2010-08-01 12:02:20.077, Address=11.160.34.102:8088, MachineId=35174, Location=machine:lonrs06437,process:17021,member:lonrs06437:Data-5, Role=RbsOdcCoreDaoODCCacheServer)
08/11/10 04:50:26.463 INFO: [STDOUT] MemberSet=ServiceMemberSet(
08/11/10 04:50:26.463 INFO: [STDERR] Member(Id=9, Timestamp=2010-08-01 12:02:20.973, Address=11.160.34.98:8089, MachineId=35170, Location=machine:lonrs06433,process:32440,member:lonrs06433:Data-2, Role=RbsOdcCoreDaoODCCacheServer)
08/11/10 04:50:26.463 INFO: [STDOUT] OldestMember=Member(Id=1, Timestamp=2010-08-01 12:01:52.122, Address=11.160.34.101:8088, MachineId=35173, Location=machine:lonrs06436,process:26160,member:lonrs06436:Data-4, Role=RbsOdcCoreDaoODCCacheServer)
08/11/10 04:50:26.463 INFO: [STDERR] Member(Id=10, Timestamp=2010-08-01 12:02:21.81, Address=11.160.34.93:8089, MachineId=35165, Location=machine:lonrs06428,process:32632,member:lonrs06428:Extend-10000-0, Role=CoherenceServer)
08/11/10 04:50:26.463 INFO: [STDOUT] ActualMemberSet=MemberSet(Size=162, BitSetCount=6
08/11/10 04:50:26.463 INFO: [STDERR] Member(Id=11, Timestamp=2010-08-01 12:02:28.017, Address=11.160.34.90:8088, MachineId=35162, Location=machine:lonrs06425,process:30069,member:lonrs06425:Data-6, Role=RbsOdcCoreDaoODCCacheServer)
... snipped a big list of members...
08/11/10 04:50:26.539 INFO: [STDOUT] Member(Id=171, Timestamp=2010-08-09 15:15:02.412, Address=11.160.34.104:8100, MachineId=35176, Location=machine:lonrs06442,process:9018,member:lonrs06442:VEST-15, Role=RbsOdcVestCoreVestMain)
08/11/10 04:50:26.539 INFO: [STDERR] Member(Id=178, Timestamp=2010-08-09 15:15:20.524, Address=11.160.34.104:8101, MachineId=35176, Location=machine:lonrs06442,process:10216,member:lonrs06442:VEST-0, Role=RbsOdcVestCoreVestMain)
08/11/10 04:50:26.539 INFO: [STDERR] )
08/11/10 04:50:26.539 INFO: [STDERR] MemberId/ServiceVersion/ServiceJoined/ServiceLeaving
08/11/10 04:50:26.539 INFO: [STDERR] 1/3.1/Sun Aug 01 12:01:55 BST 2010/false,
08/11/10 04:50:26.540 INFO: [STDERR] 2/3.1/Sun Aug 01 12:01:59 BST 2010/false,
... snipped a big list of members...
08/11/10 04:50:26.545 INFO: [STDERR] 34/3.1/Sun Aug 01 12:03:11 BST 2010/false,
08/11/10 04:50:26.545 INFO: [STDERR] 35/3.1/Sun Aug 01 12:03:12 BST 2010/false,
... snipped a big list of members...
08/11/10 04:50:26.566 INFO: [STDERR] )
08/11/10 04:50:26.566 INFO: [STDERR] )
08/11/10 04:50:26.566 INFO: [STDERR] at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onStartupTimeout(Grid.CDB:6)
08/11/10 04:50:26.567 INFO: [STDERR] at com.tangosol.coherence.component.util.daemon.queueProcessor.Service.start(Service.CDB:28)
08/11/10 04:50:26.567 INFO: [STDERR] at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.start(Grid.CDB:38)
08/11/10 04:50:26.567 INFO: [STDERR] at com.tangosol.coherence.component.util.SafeService.startService(SafeService.CDB:28)
08/11/10 04:50:26.567 INFO: [STDERR] at com.tangosol.coherence.component.util.SafeService.ensureRunningService(SafeService.CDB:27)
08/11/10 04:50:26.567 INFO: [STDERR] at com.tangosol.coherence.component.util.SafeService.start(SafeService.CDB:14)
08/11/10 04:50:26.567 INFO: [STDERR] at com.tangosol.coherence.component.net.management.Connector.startService(Connector.CDB:52)
08/11/10 04:50:26.568 INFO: [STDERR] at com.tangosol.coherence.component.net.management.gateway.Remote.registerLocalModel(Remote.CDB:8)
08/11/10 04:50:26.568 INFO: [STDERR] at com.tangosol.coherence.component.net.management.gateway.Local.registerLocalModel(Local.CDB:8)
08/11/10 04:50:26.568 INFO: [STDERR] at com.tangosol.coherence.component.net.management.Gateway.register(Gateway.CDB:1)
08/11/10 04:50:26.568 INFO: [STDERR] at com.tangosol.coherence.component.util.SafeCluster.ensureRunningCluster(SafeCluster.CDB:50)
08/11/10 04:50:26.568 INFO: [STDERR] at com.tangosol.coherence.component.util.SafeCluster.start(SafeCluster.CDB:2)
08/11/10 04:50:26.568 INFO: [STDERR] at com.tangosol.net.CacheFactory.ensureCluster(CacheFactory.java:998)
08/11/10 04:50:26.569 INFO: [STDERR] at com.tangosol.net.DefaultConfigurableCacheFactory.ensureService(DefaultConfigurableCacheFactory.java:915)
08/11/10 04:50:26.569 INFO: [STDERR] at com.tangosol.net.DefaultConfigurableCacheFactory.ensureCache(DefaultConfigurableCacheFactory.java:877)
08/11/10 04:50:26.569 INFO: [STDERR] at com.tangosol.net.DefaultConfigurableCacheFactory.configureCache(DefaultConfigurableCacheFactory.java:1088)
08/11/10 04:50:26.569 INFO: [STDERR] at com.tangosol.net.DefaultConfigurableCacheFactory.ensureCache(DefaultConfigurableCacheFactory.java:304)
08/11/10 04:50:26.569 INFO: [STDERR] at com.tangosol.net.CacheFactory.getCache(CacheFactory.java:735)
08/11/10 04:50:26.569 INFO: [STDERR] at com.tangosol.net.CacheFactory.getCache(CacheFactory.java:712)
08/11/10 04:50:26.570 INFO: [STDERR] at com.rbs.odc.core.dao.coherence.util.CacheFactoryWrapper.getCache(CacheFactoryWrapper.java:18)
08/11/10 04:50:26.572 INFO: [STDERR] at com.rbs.odc.core.access.ODCAccessLayerFactoryImpl.getInstance(ODCAccessLayerFactoryImpl.java:36)
08/11/10 04:50:26.572 INFO: [STDOUT] 2010-08-11 04:50:26,539 [Logger@9241638 3.5.3/465p2] ERROR Coherence - 2010-08-11 04:50:26.462/37.922 Oracle Coherence GE 3.5.3/465p2 <Error> (thread=Invocation:Management, member=161): validatePolls: This service timed-out due to unanswered handshake request. Manual intervention is required to stop the members that have not responded to this Poll
08/11/10 04:50:26.572 INFO: [STDERR] at com.rbs.odc.core.ODCFactoryImpl.getAccessLayerInstance(ODCFactoryImpl.java:40)
08/11/10 04:50:26.572 INFO: [STDOUT] {
08/11/10 04:50:26.572 INFO: [STDOUT] PollId=1, active
08/11/10 04:50:26.572 INFO: [STDERR] at com.rbs.odc.vest.core.VestMain.create(VestMain.java:111)
08/11/10 04:50:26.572 INFO: [STDOUT] InitTimeMillis=1281498596384
08/11/10 04:50:26.572 INFO: [STDERR] at com.rbs.odc.vest.core.VestMain.main(VestMain.java:146)
08/11/10 04:50:26.572 INFO: [STDOUT] Service=Management (1)
08/11/10 04:50:26.573 INFO: [STDOUT] RespondedMemberSet=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,145,146,147,148,149,150,152,153,154,155,156,157,158,159,160,162,171,178]
08/11/10 04:50:26.573 INFO: [STDOUT] LeftMemberSet=[]
08/11/10 04:50:26.573 INFO: [STDOUT] RemainingMemberSet=[35]
08/11/10 04:50:26.573 INFO: [STDOUT] }
This message would appear to suggest that node 161 did not get a response from node 35 when it was trying to join the Management service. This could be for a number of reasons suggested on MetaLink. But...
At the top of the log above you can see this message:
Member 35 joined Service Management with senior member 1 does that mean this node has connected to node 35?
In the log for node 35 I have the following:
At the start you can see a message saying node 161 has joined the Management service and at the end a message where node 161 leaves the Management service. At which point would node 35 have sent a response message back to node 161? Has it sent a message or has the message somehow become lost?
08/11/10 04:49:56.380 INFO: [STDOUT] 2010-08-11 04:49:56,380 [Logger@9219882 3.5.3/465p2] DEBUG Coherence - 2010-08-11 04:49:56.380/838006.988 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=35): Member 161 joined Service Management with senior member 1
08/11/10 04:49:56.772 INFO: [STDOUT] 838007.417: [GC 838007.418: [ParNew: 191744K->21248K(191744K), 0.0523130 secs] 2309371K->2155431K(2538752K), 0.0524460 secs] [Times: user=0.51 sys=0.01, real=0.06 secs]
08/11/10 04:49:57.216 INFO: [STDOUT] 838007.850: [GC 838007.850: [ParNew: 191744K->21248K(191744K), 0.0636960 secs] 2325927K->2174150K(2538752K), 0.0638240 secs] [Times: user=0.61 sys=0.01, real=0.06 secs]
08/11/10 04:49:57.670 INFO: [STDOUT] 838008.307: [GC 838008.307: [ParNew: 191744K->21248K(191744K), 0.0612950 secs] 2344646K->2191843K(2538752K), 0.0614260 secs] [Times: user=0.57 sys=0.00, real=0.07 secs]
08/11/10 04:49:58.137 INFO: [STDOUT] 838008.771: [GC 838008.771: [ParNew: 191744K->21248K(191744K), 0.0638180 secs] 2362339K->2210207K(2538752K), 0.0639470 secs] [Times: user=0.58 sys=0.01, real=0.06 secs]
08/11/10 04:49:58.281 INFO: [STDOUT] 838008.979: [CMS-concurrent-mark: 1.755/2.001 secs] [Times: user=12.83 sys=0.21, real=2.00 secs]
08/11/10 04:49:58.282 INFO: [STDOUT] 838008.979: [CMS-concurrent-preclean-start]
08/11/10 04:49:58.509 INFO: [STDOUT] 838009.207: [CMS-concurrent-preclean: 0.214/0.228 secs] [Times: user=0.67 sys=0.02, real=0.22 secs]
08/11/10 04:49:58.509 INFO: [STDOUT] 838009.207: [CMS-concurrent-abortable-preclean-start]
08/11/10 04:49:58.580 INFO: [STDOUT] 838009.235: [GC 838009.235: [ParNew: 191744K->21248K(191744K), 0.0421810 secs] 2380703K->2221020K(2538752K), 0.0423110 secs] [Times: user=0.44 sys=0.00, real=0.04 secs]
08/11/10 04:49:58.777 INFO: [STDOUT] 838009.474: [CMS-concurrent-abortable-preclean: 0.217/0.267 secs] [Times: user=1.14 sys=0.03, real=0.27 secs]
08/11/10 04:49:58.848 INFO: [STDOUT] 838009.476: [GC[YG occupancy: 114884 K (191744 K)]838009.476: [Rescan (parallel) , 0.0706030 secs]838009.546: [weak refs processing, 0.0000170 secs] [1 CMS-remark: 2199772K(2347008K)] 2314657K(2538752K), 0.0707150 secs] [Times: user=0.31 sys=0.01, real=0.07 secs]
08/11/10 04:49:58.849 INFO: [STDOUT] 838009.547: [CMS-concurrent-sweep-start]
08/11/10 04:49:59.080 INFO: [STDOUT] 838009.724: [GC 838009.724: [ParNew: 191742K->21248K(191744K), 0.0542140 secs] 2391515K->2230929K(2538752K), 0.0543510 secs] [Times: user=0.62 sys=0.00, real=0.06 secs]
08/11/10 04:49:59.485 INFO: [STDOUT] 838010.129: [GC 838010.129: [ParNew: 191744K->21248K(191744K), 0.0540100 secs] 2394178K->2242667K(2538752K), 0.0541370 secs] [Times: user=0.51 sys=0.00, real=0.05 secs]
08/11/10 04:49:59.972 INFO: [STDOUT] 838010.617: [GC 838010.617: [ParNew: 191744K->21248K(191744K), 0.0526810 secs] 2274960K->2117834K(2538752K), 0.0528130 secs] [Times: user=0.51 sys=0.00, real=0.06 secs]
08/11/10 04:50:00.457 INFO: [STDOUT] 838011.095: [GC 838011.095: [ParNew: 191744K->21248K(191744K), 0.0600530 secs] 2111344K->1956635K(2538752K), 0.0601820 secs] [Times: user=0.57 sys=0.02, real=0.06 secs]
08/11/10 04:50:00.826 INFO: [STDOUT] 838011.524: [CMS-concurrent-sweep: 1.751/1.978 secs] [Times: user=7.82 sys=0.10, real=1.98 secs]
08/11/10 04:50:00.826 INFO: [STDOUT] 838011.524: [CMS-concurrent-reset-start]
08/11/10 04:50:00.837 INFO: [STDOUT] 838011.535: [CMS-concurrent-reset: 0.011/0.011 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]
08/11/10 04:50:00.905 INFO: [STDOUT] 838011.555: [GC 838011.555: [ParNew: 191744K->21248K(191744K), 0.0474180 secs] 1965205K->1809170K(2538752K), 0.0475450 secs] [Times: user=0.47 sys=0.01, real=0.05 secs]
...snipped more lines like the above...
08/11/10 04:50:09.294 INFO: [STDOUT] 838019.927: [GC 838019.927: [ParNew: 191744K->21248K(191744K), 0.0643710 secs] 2273040K->2120949K(2538752K), 0.0645110 secs] [Times: user=0.62 sys=0.00, real=0.07 secs]
08/11/10 04:50:09.365 INFO: [STDOUT] 838019.994: [GC [1 CMS-initial-mark: 2099701K(2347008K)] 2124402K(2538752K), 0.0687110 secs] [Times: user=0.07 sys=0.00, real=0.07 secs]
08/11/10 04:50:09.365 INFO: [STDOUT] 838020.063: [CMS-concurrent-mark-start]
08/11/10 04:50:09.776 INFO: [STDOUT] 838020.411: [GC 838020.411: [ParNew: 191593K->21248K(191744K), 0.0627830 secs] 2291295K->2133868K(2538752K), 0.0629180 secs] [Times: user=0.66 sys=0.00, real=0.06 secs]
08/11/10 04:50:10.232 INFO: [STDOUT] 838020.863: [GC 838020.863: [ParNew: 191744K->21248K(191744K), 0.0667860 secs] 2304364K->2149235K(2538752K), 0.0669190 secs] [Times: user=0.68 sys=0.02, real=0.07 secs]
08/11/10 04:50:10.639 INFO: [STDOUT] 838021.295: [GC 838021.295: [ParNew: 191744K->21248K(191744K), 0.0417360 secs] 2319731K->2161508K(2538752K), 0.0418720 secs] [Times: user=0.45 sys=0.01, real=0.04 secs]
08/11/10 04:50:11.070 INFO: [STDOUT] 838021.723: [GC 838021.723: [ParNew: 191744K->21248K(191744K), 0.0446820 secs] 2332004K->2172820K(2538752K), 0.0448170 secs] [Times: user=0.43 sys=0.01, real=0.05 secs]
08/11/10 04:50:11.297 INFO: [STDOUT] 838021.995: [CMS-concurrent-mark: 1.711/1.932 secs] [Times: user=12.69 sys=0.26, real=1.93 secs]
08/11/10 04:50:11.297 INFO: [STDOUT] 838021.995: [CMS-concurrent-preclean-start]
08/11/10 04:50:11.443 INFO: [STDOUT] 838022.141: [CMS-concurrent-preclean: 0.139/0.145 secs] [Times: user=0.42 sys=0.01, real=0.15 secs]
08/11/10 04:50:11.443 INFO: [STDOUT] 838022.141: [CMS-concurrent-abortable-preclean-start]
08/11/10 04:50:11.506 INFO: [STDOUT] 838022.159: [GC 838022.159: [ParNew: 191744K->21248K(191744K), 0.0451600 secs] 2343316K->2185305K(2538752K), 0.0452930 secs] [Times: user=0.49 sys=0.00, real=0.04 secs]
08/11/10 04:50:11.732 INFO: [STDOUT] 838022.430: [CMS-concurrent-abortable-preclean: 0.238/0.289 secs] [Times: user=1.25 sys=0.02, real=0.29 secs]
08/11/10 04:50:11.787 INFO: [STDOUT] 838022.431: [GC[YG occupancy: 116766 K (191744 K)]838022.431: [Rescan (parallel) , 0.0539670 secs]838022.485: [weak refs processing, 0.0000120 secs] [1 CMS-remark: 2164057K(2347008K)] 2280824K(2538752K), 0.0540660 secs] [Times: user=0.25 sys=0.01, real=0.05 secs]
08/11/10 04:50:11.787 INFO: [STDOUT] 838022.485: [CMS-concurrent-sweep-start]
08/11/10 04:50:12.007 INFO: [STDOUT] 838022.656: [GC 838022.656: [ParNew: 191744K->21248K(191744K), 0.0482630 secs] 2351996K->2194514K(2538752K), 0.0483850 secs] [Times: user=0.50 sys=0.00, real=0.05 secs]
08/11/10 04:50:12.466 INFO: [STDOUT] 838023.100: [GC 838023.100: [ParNew: 191744K->21248K(191744K), 0.0633230 secs] 2321717K->2169489K(2538752K), 0.0634400 secs] [Times: user=0.60 sys=0.01, real=0.06 secs]
08/11/10 04:50:12.939 INFO: [STDOUT] 838023.556: [GC 838023.556: [ParNew: 191744K->21248K(191744K), 0.0808710 secs] 2217912K->2068934K(2538752K), 0.0809900 secs] [Times: user=0.71 sys=0.01, real=0.08 secs]
08/11/10 04:50:13.401 INFO: [STDOUT] 838024.046: [GC 838024.047: [ParNew: 191744K->21248K(191744K), 0.0526430 secs] 2057771K->1903298K(2538752K), 0.0527570 secs] [Times: user=0.53 sys=0.00, real=0.06 secs]
08/11/10 04:50:13.698 INFO: [STDOUT] 838024.396: [CMS-concurrent-sweep: 1.661/1.911 secs] [Times: user=7.45 sys=0.06, real=1.91 secs]
08/11/10 04:50:13.698 INFO: [STDOUT] 838024.396: [CMS-concurrent-reset-start]
08/11/10 04:50:13.709 INFO: [STDOUT] 838024.407: [CMS-concurrent-reset: 0.011/0.011 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]
08/11/10 04:50:13.862 INFO: [STDOUT] 838024.504: [GC 838024.504: [ParNew: 191744K->21248K(191744K), 0.0555900 secs] 1955512K->1802116K(2538752K), 0.0557060 secs] [Times: user=0.55 sys=0.00, real=0.06 secs]
08/11/10 04:50:14.310 INFO: [STDOUT] 838024.955: [GC 838024.955: [ParNew: 191744K->21248K(191744K), 0.0529320 secs] 1972612K->1815824K(2538752K), 0.0530480 secs] [Times: user=0.54 sys=0.00, real=0.06 secs]
...snipped more lines like the above...
08/11/10 04:50:22.700 INFO: [STDOUT] 838033.338: [GC 838033.338: [ParNew: 191744K->21248K(191744K), 0.0605790 secs] 2301499K->2150625K(2538752K), 0.0607030 secs] [Times: user=0.55 sys=0.01, real=0.07 secs]
08/11/10 04:50:22.752 INFO: [STDOUT] 838033.399: [GC [1 CMS-initial-mark: 2129377K(2347008K)] 2151938K(2538752K), 0.0507410 secs] [Times: user=0.05 sys=0.00, real=0.05 secs]
08/11/10 04:50:22.752 INFO: [STDOUT] 838033.450: [CMS-concurrent-mark-start]
08/11/10 04:50:23.243 INFO: [STDOUT] 838033.858: [GC 838033.858: [ParNew: 191744K->21248K(191744K), 0.0819830 secs] 2321121K->2172986K(2538752K), 0.0821080 secs] [Times: user=0.76 sys=0.01, real=0.08 secs]
08/11/10 04:50:23.670 INFO: [STDOUT] 838034.322: [GC 838034.322: [ParNew: 191744K->21248K(191744K), 0.0449420 secs] 2343482K->2184104K(2538752K), 0.0450690 secs] [Times: user=0.48 sys=0.00, real=0.04 secs]
08/11/10 04:50:24.089 INFO: [STDOUT] 838034.738: [GC 838034.738: [ParNew: 191744K->21248K(191744K), 0.0479310 secs] 2354600K->2194712K(2538752K), 0.0480740 secs] [Times: user=0.47 sys=0.00, real=0.04 secs]
08/11/10 04:50:24.488 INFO: [STDOUT] 838035.142: [GC 838035.142: [ParNew: 191744K->21248K(191744K), 0.0431990 secs] 2365208K->2208919K(2538752K), 0.0433620 secs] [Times: user=0.41 sys=0.01, real=0.04 secs]
08/11/10 04:50:24.691 INFO: [STDOUT] 838035.388: [CMS-concurrent-mark: 1.714/1.938 secs] [Times: user=12.50 sys=0.32, real=1.94 secs]
08/11/10 04:50:24.691 INFO: [STDOUT] 838035.389: [CMS-concurrent-preclean-start]
08/11/10 04:50:24.910 INFO: [STDOUT] 838035.557: [GC 838035.557: [ParNew: 191744K->21248K(191744K), 0.0513880 secs] 2379415K->2231745K(2538752K), 0.0515200 secs] [Times: user=0.51 sys=0.01, real=0.06 secs]
08/11/10 04:50:24.994 INFO: [STDOUT] 838035.692: [CMS-concurrent-preclean: 0.235/0.303 secs] [Times: user=1.33 sys=0.05, real=0.30 secs]
08/11/10 04:50:24.994 INFO: [STDOUT] 838035.692: [CMS-concurrent-abortable-preclean-start]
08/11/10 04:50:25.195 INFO: [STDOUT] 838035.893: [CMS-concurrent-abortable-preclean: 0.097/0.201 secs] [Times: user=0.71 sys=0.00, real=0.20 secs]
08/11/10 04:50:25.384 INFO: [STDOUT] 838035.894: [GC[YG occupancy: 183171 K (191744 K)]838035.894: [Rescan (parallel) , 0.1881420 secs]838036.082: [weak refs processing, 0.0000180 secs] [1 CMS-remark: 2210497K(2347008K)] 2393668K(2538752K), 0.1882460 secs] [Times: user=0.44 sys=0.01, real=0.19 secs]
08/11/10 04:50:25.384 INFO: [STDOUT] 838036.082: [CMS-concurrent-sweep-start]
08/11/10 04:50:25.469 INFO: [STDOUT] 838036.091: [GC 838036.091: [ParNew: 191744K->21248K(191744K), 0.0758220 secs] 2402241K->2249696K(2538752K), 0.0759740 secs] [Times: user=0.78 sys=0.01, real=0.07 secs]
08/11/10 04:50:25.906 INFO: [STDOUT] 838036.498: [GC 838036.498: [ParNew: 191744K->21248K(191744K), 0.1048980 secs] 2416373K->2273939K(2538752K), 0.1050130 secs] [Times: user=0.68 sys=0.02, real=0.10 secs]
08/11/10 04:50:26.362 INFO: [STDOUT] 838037.003: [GC 838037.003: [ParNew: 191744K->21248K(191744K), 0.0564760 secs] 2392077K->2235695K(2538752K), 0.0565890 secs] [Times: user=0.54 sys=0.00, real=0.06 secs]
08/11/10 04:50:26.465 INFO: [STDOUT] 2010-08-11 04:50:26,464 [Logger@9219882 3.5.3/465p2] DEBUG Coherence - 2010-08-11 04:50:26.464/838037.072 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=35): Member 161 left service Management with senior member 1
08/11/10 04:50:26.837 INFO: [STDOUT] 838037.459: [GC 838037.459: [ParNew: 191744K->21248K(191744K), 0.0763280 secs] 2245791K->2097939K(2538752K), 0.0764460 secs] [Times: user=0.70 sys=0.01, real=0.07 secs]
08/11/10 04:50:27.194 INFO: [STDOUT] 2010-08-11 04:50:27,193 [Logger@9219882 3.5.3/465p2] DEBUG Coherence - 2010-08-11 04:50:27.186/838037.794 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=35): MemberLeft notification for Member 161 received from Member(Id=24, Timestamp=2010-08-01 12:02:55.514, Address=11.160.34.93:8091, MachineId=35165, Location=machine:lonrs06428,process:1622,member:lonrs06428:Data-7, Role=RbsOdcCoreDaoODCCacheServer)
08/11/10 04:50:27.194 INFO: [STDOUT] 2010-08-11 04:50:27,194 [Logger@9219882 3.5.3/465p2] DEBUG Coherence - 2010-08-11 04:50:27.186/838037.794 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=35): Member(Id=161, Timestamp=2010-08-11 04:50:27.186, Address=11.160.34.104:8088, MachineId=35176, Location=machine:lonrs06442,process:18540,member:lonrs06442:VEST-18, Role=RbsOdcVestCoreVestMain) left Cluster with senior member 1
08/11/10 04:50:27.351 INFO: [STDOUT] 838037.966: [GC 838037.966: [ParNew: 191744K->21248K(191744K), 0.0820220 secs] 2093359K->1944221K(2538752K), 0.0821380 secs] [Times: user=0.76 sys=0.01, real=0.09 secs]
08/11/10 04:50:27.661 INFO: [STDOUT] 838038.359: [CMS-concurrent-sweep: 1.867/2.277 secs] [Times: user=9.54 sys=0.12, real=2.28 secs]
08/11/10 04:50:27.661 INFO: [STDOUT] 838038.359: [CMS-concurrent-reset-start]
08/11/10 04:50:27.672 INFO: [STDOUT] 838038.370: [CMS-concurrent-reset: 0.011/0.011 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]
We have seen this situation a few times with different nodes and different services in the error message. Sometimes the RemainingMemberSet=[] list has one node in it sometimes it has many nodes.
Ruling out the obvious stuff like GCs etc... any other pointers to what might be happening would be appreciated.
Edited on 27th October 2010:
We are still seeing this and it is becoming more frequent if we attempt to get new processes to join an existing cluster.
I would be grateful if someone can explain what the time-out actually is and if it is possible to adjust it in any way.
Cheers,
JK