Manual Intervention required
Hi,
We've had some problems with Coherence nodes crashing under high write load. I've attached the stack - can anyone help? Cameron mentioned that he knew something about this bug.
Kindest regards,
Andrew.
2009-01-16 09:44:49,140 ERROR [Logger@10222989 3.3.1/389] [Coherence] 2009-01-16 09:44:49.140 Oracle Coherence EE 3.3.1/389 <Error> (thread=Distrib
utedCache, member=19): validatePolls: Manual intervention is required to stop this node or the members that have not responded to this poll (gap si
ze=4194994): Poll
{
PollId=26185410, active
InitTimeMillis=1231755598556
Service=DistributedCache (3)
RespondedMemberSet=[]
LeftMemberSet=[]
RemainingMemberSet=[14]
Request=Message "GetRequest"
{
FromMember=Member(Id=19, Timestamp=2008-11-04 12:08:15.46, Address=10.236.1.14:29092, MachineId=59150, Location=machine:loneqesappp,process:2151@
loneqesappp8,member:web-1, Role=web)
FromMessageId=34529738
Internal=false
MessagePartCount=1
PendingCount=0
MessageType=6
ToPollId=0
Poll=null
Packets
{
Service=DistributedCache{Name=DistributedCache, State=(SERVICE_STARTED), Id=3, Version=3.2, OldestMemberId=1, LocalStorage=disabled
ToMemberSet=MemberSet(Size=1, BitSetCount=1
Member(Id=14, Timestamp=2008-11-04 12:05:32.039, Address=10.238.75.1:40022, MachineId=44801, Location=machine:loneqesappb,process:5379@loneqe
sappb7,member:cache-3, Role=cache)
)
NotifySent=false
within array capacity=8388608, expanions=17, size=30380407, window index=0, window size=2720399, first index=27660008, last index=30380406, opt
imistic gets=30479934, place holder allocations=693, waits=0, waiting threads=0
2009-01-16 14:45:57,171 ERROR [Logger@10222989 3.3.1/389] [Coherence] 2009-01-16 14:45:57.171 Oracle Coherence EE 3.3.1/389 <Error> (thread=Distrib
utedCache, member=19): validatePolls: Manual intervention is required to stop this node or the members that have not responded to this poll (gap si
ze=4194818): Poll
{
PollId=27660008, active
InitTimeMillis=1231840579761
Service=DistributedCache (3)
RespondedMemberSet=[]
LeftMemberSet=[]
RemainingMemberSet=[7]
Request=Message "GetRequest"
{
FromMember=Member(Id=19, Timestamp=2008-11-04 12:08:15.46, Address=10.236.1.14:29092, MachineId=59150, Location=machine:loneqesappp,process:2151@
loneqesappp8,member:web-1, Role=web)
FromMessageId=36124981
Internal=false
MessagePartCount=1
PendingCount=0
MessageType=6
ToPollId=0
Poll=null
Packets
{
Service=DistributedCache{Name=DistributedCache, State=(SERVICE_STARTED), Id=3, Version=3.2, OldestMemberId=1, LocalStorage=disabled
ToMemberSet=MemberSet(Size=1, BitSetCount=1
Member(Id=7, Timestamp=2008-11-04 12:04:12.485, Address=10.236.1.13:40032, MachineId=59149, Location=machine:loneqesappp,process:7740@loneqes
appp7,member:cache-4, Role=cache)
)
NotifySent=false
within array capacity=8388608, expanions=17, size=31854827, window index=0, window size=4194807, first index=27660020, last index=31854826, opt
imistic gets=31957402, place holder allocations=693, waits=0, waiting threads=0
2009-01-16 14:45:57,478 ERROR [Logger@10222989 3.3.1/389] [Coherence] 2009-01-16 14:45:57.477 Oracle Coherence EE 3.3.1/389 <Error> (thread=Distrib
utedCache, member=19): validatePolls: Manual intervention is required to stop this node or the members that have not responded to this poll (gap si
ze=4195428): Poll
{
PollId=27660020, active
InitTimeMillis=1231840881361
Service=DistributedCache (3)
RespondedMemberSet=[]
LeftMemberSet=[]
RemainingMemberSet=[13]
Request=Message "GetRequest"
{
FromMember=Member(Id=19, Timestamp=2008-11-04 12:08:15.46, Address=10.236.1.14:29092, MachineId=59150, Location=machine:loneqesappp,process:2151@
loneqesappp8,member:web-1, Role=web)
FromMessageId=36125305
Internal=false
MessagePartCount=1
PendingCount=0
MessageType=6
ToPollId=0
Poll=null
Packets
{
Service=DistributedCache{Name=DistributedCache, State=(SERVICE_STARTED), Id=3, Version=3.2, OldestMemberId=1, LocalStorage=disabled
ToMemberSet=MemberSet(Size=1, BitSetCount=1
Member(Id=13, Timestamp=2008-11-04 12:04:58.503, Address=10.238.75.2:40012, MachineId=44802, Location=machine:loneqesappb,process:20808@loneq
esappb8,member:cache-2, Role=cache)
)
NotifySent=false
within array capacity=8388608, expanions=17, size=31855451, window index=0, window size=4171842, first index=27683609, last index=31855450, opt
imistic gets=31958023, place holder allocations=693, waits=0, waiting threads=0
2009-01-16 14:46:02,942 ERROR [Logger@10222989 3.3.1/389] [Coherence] 2009-01-16 14:46:02.942 Oracle Coherence EE 3.3.1/389 <Error> (thread=Distrib
utedCache, member=19): validatePolls: Manual intervention is required to stop this node or the members that have not responded to this poll (gap si
ze=4194705): Poll
{
PollId=27683609, active
InitTimeMillis=1231845364476
Service=DistributedCache (3)
RespondedMemberSet=[]
LeftMemberSet=[]
RemainingMemberSet=[2]
Request=Message "GetRequest"
{
FromMember=Member(Id=19, Timestamp=2008-11-04 12:08:15.46, Address=10.236.1.14:29092, MachineId=59150, Location=machine:loneqesappp,process:2151@
loneqesappp8,member:web-1, Role=web)
FromMessageId=36153742
Internal=false
MessagePartCount=1
PendingCount=0
MessageType=6
ToPollId=0
Poll=null
Packets
{
Service=DistributedCache{Name=DistributedCache, State=(SERVICE_STARTED), Id=3, Version=3.2, OldestMemberId=1, LocalStorage=disabled
ToMemberSet=MemberSet(Size=1, BitSetCount=1
Member(Id=2, Timestamp=2008-11-04 12:01:30.892, Address=10.236.1.13:40022, MachineId=59149, Location=machine:loneqesappp,process:7712@loneqes
appp7,member:cache-3, Role=cache)
)
NotifySent=false
within array capacity=8388608, expanions=17, size=31878318, window index=0, window size=3524439, first index=28353879, last index=31878317, opt
imistic gets=31980935, place holder allocations=693, waits=0, waiting threads=0
---TRACE 2
2009-01-16 14:45:55,118 WARN [Logger@10191909 3.3.1/389] [Coherence] 2009-01-16 14:45:55.117 Oracle Coherence EE 3.3.1/389 <Warning> (thread=Packe
tPublisher, member=13): Experienced a 1173 ms communication delay (probable remote GC) with Member(Id=19, Timestamp=2008-11-04 12:08:15.46, Address
=10.236.1.14:29092, MachineId=59150, Location=machine:loneqesappp,process:2151@loneqesappp8,member:web-1, Role=web); 32 packets rescheduled, PauseR
ate=0.0, Threshold=1785
-- TRACE 3
2009-01-13 13:31:40,828 ERROR [Logger@10226097 3.3.1/389] [Coherence] 2009-01-13 13:31:40.827 Oracle Coherence EE 3.3.1/389 <Error> (thread=Distrib
utedCache, member=4): validatePolls: Manual intervention is required to stop this node or the members that have not responded to this poll (gap siz
e=4194454): Poll
{
PollId=31653390, active
InitTimeMillis=1230994948717
Service=DistributedCache (3)
RespondedMemberSet=[]
LeftMemberSet=[]
RemainingMemberSet=[2]
Request=Message "GetRequest"
{
FromMember=Member(Id=4, Timestamp=2008-11-04 12:06:46.181, Address=10.238.75.2:29092, MachineId=44802, Location=machine:loneqesappb,process:21378
@loneqesappb8,member:web-1, Role=web)
FromMessageId=38869080
Internal=false
MessagePartCount=1
PendingCount=0
MessageType=6
ToPollId=0
Poll=null
Packets
{
Service=DistributedCache{Name=DistributedCache, State=(SERVICE_STARTED), Id=3, Version=3.2, OldestMemberId=1, LocalStorage=disabled
ToMemberSet=MemberSet(Size=1, BitSetCount=1
Member(Id=2, Timestamp=2008-11-04 12:01:30.892, Address=10.236.1.13:40022, MachineId=59149, Location=machine:loneqesappp,process:7712@loneqes
appp7,member:cache-3, Role=cache)
)
NotifySent=false
within array capacity=8388608, expanions=17, size=35847849, window index=0, window size=1970823, first index=33877026, last index=35847848, opt
imistic gets=35917395, place holder allocations=2022, waits=0, waiting threads=0
2009-01-16 14:45:19,994 ERROR [Logger@10226097 3.3.1/389] [Coherence] 2009-01-16 14:45:19.994 Oracle Coherence EE 3.3.1/389 <Error> (thread=Distrib
utedCache, member=4): validatePolls: Manual intervention is required to stop this node or the members that have not responded to this poll (gap siz
e=4194399): Poll
{
PollId=33877026, active
InitTimeMillis=1231402622390
Service=DistributedCache (3)
RespondedMemberSet=[]
LeftMemberSet=[]
RemainingMemberSet=[2]
Request=Message "GetRequest"
{
FromMember=Member(Id=4, Timestamp=2008-11-04 12:06:46.181, Address=10.238.75.2:29092, MachineId=44802, Location=machine:loneqesappb,process:21378
@loneqesappb8,member:web-1, Role=web)
FromMessageId=41643934
Internal=false
MessagePartCount=1
PendingCount=0
MessageType=6
ToPollId=0
Poll=null
Packets
{
Service=DistributedCache{Name=DistributedCache, State=(SERVICE_STARTED), Id=3, Version=3.2, OldestMemberId=1, LocalStorage=disabled
ToMemberSet=MemberSet(Size=1, BitSetCount=1
Member(Id=2, Timestamp=2008-11-04 12:01:30.892, Address=10.236.1.13:40022, MachineId=59149, Location=machine:loneqesappp,process:7712@loneqes
appp7,member:cache-3, Role=cache)
)
NotifySent=false
within array capacity=8388608, expanions=17, size=38071434, window index=0, window size=2223470, first index=35847964, last index=38071433, opt
imistic gets=38147372, place holder allocations=2022, waits=0, waiting threads=0
2009-01-16 12:38:32,297 ERROR [Logger@10172484 3.3.1/389] [Coherence] 2009-01-16 12:38:32.296 Oracle Coherence EE 3.3.1/389 <Error> (thread=Distrib
utedCache, member=5): validatePolls: Manual intervention is required to stop this node or the members that have not responded to this poll (gap siz
e=4194534): Poll
{
PollId=16095231, active
InitTimeMillis=1231382688649
Service=DistributedCache (3)
RespondedMemberSet=[]
LeftMemberSet=[]
RemainingMemberSet=[13]
Request=Message "GetRequest"
{
FromMember=Member(Id=5, Timestamp=2008-11-04 12:03:30.711, Address=10.236.1.13:29092, MachineId=59149, Location=machine:loneqesappp,process:7840@
loneqesappp7,member:web-1, Role=web)
FromMessageId=22909848
Internal=false
MessagePartCount=1
PendingCount=0
MessageType=6
ToPollId=0
Poll=null
Packets
{
Service=DistributedCache{Name=DistributedCache, State=(SERVICE_STARTED), Id=3, Version=3.2, OldestMemberId=1, LocalStorage=disabled
ToMemberSet=MemberSet(Size=1, BitSetCount=1
Member(Id=13, Timestamp=2008-11-04 12:04:58.503, Address=10.238.75.2:40012, MachineId=44802, Location=machine:loneqesappb,process:20808@loneq
esappb8,member:cache-2, Role=cache)
)
NotifySent=false
within array capacity=8388608, expanions=17, size=20289766, window index=0, window size=3539027, first index=16750739, last index=20289765, opt
imistic gets=20353716, place holder allocations=330, waits=0, waiting threads=0
2009-01-16 12:55:21,295 ERROR [Logger@10172484 3.3.1/389] [Coherence] 2009-01-16 12:55:21.294 Oracle Coherence EE 3.3.1/389 <Error> (thread=Distrib
utedCache, member=5): validatePolls: Manual intervention is required to stop this node or the members that have not responded to this poll (gap siz
e=4195153): Poll
{
PollId=16750739, active
InitTimeMillis=1231574887534
Service=DistributedCache (3)
RespondedMemberSet=[]
LeftMemberSet=[]
RemainingMemberSet=[2]
Request=Message "GetRequest"
{
FromMember=Member(Id=5, Timestamp=2008-11-04 12:03:30.711, Address=10.236.1.13:29092, MachineId=59149, Location=machine:loneqesappp,process:7840@
loneqesappp7,member:web-1, Role=web)
FromMessageId=23771743
Internal=false
MessagePartCount=1
PendingCount=0
MessageType=6
ToPollId=0
Poll=null
Packets
{
Service=DistributedCache{Name=DistributedCache, State=(SERVICE_STARTED), Id=3, Version=3.2, OldestMemberId=1, LocalStorage=disabled
ToMemberSet=MemberSet(Size=1, BitSetCount=1
Member(Id=2, Timestamp=2008-11-04 12:01:30.892, Address=10.236.1.13:40022, MachineId=59149, Location=machine:loneqesappp,process:7712@loneqes
appp7,member:cache-3, Role=cache)
)
NotifySent=false
within array capacity=8388608, expanions=17, size=20945893, window index=0, window size=4193137, first index=16752756, last index=20945892, opt
imistic gets=21010792, place holder allocations=330, waits=0, waiting threads=0
2009-01-16 12:55:23,593 ERROR [Logger@10172484 3.3.1/389] [Coherence] 2009-01-16 12:55:23.592 Oracle Coherence EE 3.3.1/389 <Error> (thread=Distrib
utedCache, member=5): validatePolls: Manual intervention is required to stop this node or the members that have not responded to this poll (gap siz
e=4195137): Poll
{
PollId=16752756, active
InitTimeMillis=1231576112604
Service=DistributedCache (3)
RespondedMemberSet=[]
LeftMemberSet=[]
RemainingMemberSet=[21]
Request=Message "GetRequest"
{
FromMember=Member(Id=5, Timestamp=2008-11-04 12:03:30.711, Address=10.236.1.13:29092, MachineId=59149, Location=machine:loneqesappp,process:7840@
loneqesappp7,member:web-1, Role=web)
FromMessageId=23775035
Internal=false
MessagePartCount=1
PendingCount=0
MessageType=6
ToPollId=0
Poll=null
Packets
{
Service=DistributedCache{Name=DistributedCache, State=(SERVICE_STARTED), Id=3, Version=3.2, OldestMemberId=1, LocalStorage=disabled
ToMemberSet=MemberSet(Size=1, BitSetCount=1
Member(Id=21, Timestamp=2008-11-04 12:08:41.438, Address=10.238.75.2:40002, MachineId=44802, Location=machine:loneqesappb,process:20787@loneq
esappb8,member:cache-1, Role=cache)
)
NotifySent=false
within array capacity=8388608, expanions=17, size=20947894, window index=0, window size=3558932, first index=17388962, last index=20947893, opt
imistic gets=21012793, place holder allocations=330, waits=0, waiting threads=0
2009-01-16 13:11:20,573 ERROR [Logger@10172484 3.3.1/389] [Coherence] 2009-01-16 13:11:20.573 Oracle Coherence EE 3.3.1/389 <Error> (thread=Distrib
utedCache, member=5): validatePolls: Manual intervention is required to stop this node or the members that have not responded to this poll (gap siz
e=4194448): Poll
{
PollId=17388962, active
InitTimeMillis=1231840981385
Service=DistributedCache (3)
RespondedMemberSet=[]
LeftMemberSet=[]
RemainingMemberSet=[7]
Request=Message "GetRequest"
{
FromMember=Member(Id=5, Timestamp=2008-11-04 12:03:30.711, Address=10.236.1.13:29092, MachineId=59149, Location=machine:loneqesappp,process:7840@
loneqesappp7,member:web-1, Role=web)
FromMessageId=24694954
Internal=false
MessagePartCount=1
PendingCount=0
MessageType=6
ToPollId=0
Poll=null
Packets
{
Service=DistributedCache{Name=DistributedCache, State=(SERVICE_STARTED), Id=3, Version=3.2, OldestMemberId=1, LocalStorage=disabled
ToMemberSet=MemberSet(Size=1, BitSetCount=1
Member(Id=7, Timestamp=2008-11-04 12:04:12.485, Address=10.236.1.13:40032, MachineId=59149, Location=machine:loneqesappp,process:7740@loneqes
appp7,member:cache-4, Role=cache)
)
NotifySent=false
within array capacity=8388608, expanions=17, size=21583411, window index=0, window size=1293624, first index=20289787, last index=21583410, opt
imistic gets=21649257, place holder allocations=330, waits=0, waiting threads=0