Hi all, we are using BDB JE 6.4.9 in production environment. We bulid a ReplicationEnvironment with two node using HA mode and only the master node can provide service.
Our application runs well for all the past time before it met a network problem that the apllication runs on master node is still alive but couldn't connect to the replica and any other host. This last about 5 minutes, during this period the replica "DesignatePrimary" itself so that it can provide service. But when the network is rebuild, we meet some problem when the original master try to re-join the group:
com.sleepycat.je.rep.InsufficientLogException: (JE 6.4.9) Environment must be closed, caused by: com.sleepycat.je.rep.InsufficientLogException: Environment invalid because of previous exception: (JE 6.4.9) logic2-10.254.81.237(1):/logic2 INSUFFICIENT_LOG: Log files at this node are obsolete. Environment is invalid and must be closed. Originally thrown by HA thread: REPLICA logic2-10.254.81.237(1) Originally thrown by HA thread: REPLICA logic2-10.254.81.237(1)
at com.sleepycat.je.rep.InsufficientLogException.wrapSelf(InsufficientLogException.java:129)
at com.sleepycat.je.dbi.EnvironmentImpl.checkIfInvalid(EnvironmentImpl.java:1673)
at com.sleepycat.je.Environment.checkEnv(Environment.java:2486)
at com.sleepycat.je.rep.ReplicatedEnvironment.getGroup(ReplicatedEnvironment.java:784)
at com.tencent.hippo.broker.service.impl.DefStoreManagerService.getBrokerGroup(DefStoreManagerService.java:706)
at com.tencent.hippo.broker.service.impl.DefaultBrokerService$HeartBeatTask.run(DefaultBrokerService.java:1536)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.sleepycat.je.rep.InsufficientLogException: Environment invalid because of previous exception: (JE 6.4.9) logic2-10.254.81.237(1):/logic2 INSUFFICIENT_LOG: Log files at this node are obsolete. Environment is invalid and must be closed. Originally thrown by HA thread: REPLICA logic2-10.254.81.237(1) Originally thrown by HA thread: REPLICA logic2-10.254.81.237(1)
at com.sleepycat.je.rep.stream.ReplicaFeederSyncup.setupLogRefresh(ReplicaFeederSyncup.java:665)
at com.sleepycat.je.rep.stream.ReplicaFeederSyncup.verifyRollback(ReplicaFeederSyncup.java:335)
at com.sleepycat.je.rep.stream.ReplicaFeederSyncup.execute(ReplicaFeederSyncup.java:184)
at com.sleepycat.je.rep.impl.node.Replica.initReplicaLoop(Replica.java:742)
at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoopInternal(Replica.java:505)
at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoop(Replica.java:440)
at com.sleepycat.je.rep.impl.node.RepNode.run(RepNode.java:1746)
we search the je.info file and find something confusing. In the original master node which try to re-join to group as replica because the original replica has DesignatePrimary the je.info shows:
2017-08-09 08:02:48.320 UTC INFO [logic2-10.254.81.237] Replica loop started with master: logic2-10.254.89.154(2)
2017-08-09 08:02:48.321 UTC INFO [logic2-10.254.81.237] Replica-feeder handshake start
2017-08-09 08:02:48.335 UTC INFO [logic2-10.254.81.237] Replica-feeder logic2-10.254.89.154 handshake completed.
2017-08-09 08:02:48.336 UTC INFO [logic2-10.254.81.237] Replica syncup after election to verify master: logic2-10.254.89.154(2) elected master: logic2-10.254.89.154(2)
2017-08-09 08:02:48.336 UTC INFO [logic2-10.254.81.237] Replica-feeder logic2-10.254.89.154 syncup started. Replica range: first=52,024,819,277 last=52,753,505,140 sync=52,753,505,140 txnEnd=52,753,505,140
2017-08-09 08:02:49.151 UTC INFO [logic2-10.254.81.237] matchpointVLSN of 52,753,505,132 precedes a checkpoint end, needs network restore.
2017-08-09 08:02:49.152 UTC INFO [logic2-10.254.81.237] Releasing commit block latch
2017-08-09 08:02:49.152 UTC INFO [logic2-10.254.81.237] Replica-feeder logic2-10.254.89.154 syncup ended. Elapsed time: 816ms
2017-08-09 08:02:49.152 UTC WARNING [ logic2-10.254.81.237] Exiting inner Replica loop with exception com.sleepycat.je.rep.InsufficientLogException: (JE 6.4.9) logic2-10.254.81.237(1):/logic2 INSUFFICIENT_LOG: Log files at this node are obsolete. Environment is invalid and must be closed.
com.sleepycat.je.rep.InsufficientLogException: (JE 6.4.9) logic2-10.254.81.237(1): /logic2 INSUFFICIENT_LOG: Log files at this node are obsolete. Environment is invalid and must be closed.
at com.sleepycat.je.rep.stream.ReplicaFeederSyncup.setupLogRefresh(ReplicaFeederSyncup.java:665)
at com.sleepycat.je.rep.stream.ReplicaFeederSyncup.verifyRollback(ReplicaFeederSyncup.java:335)
at com.sleepycat.je.rep.stream.ReplicaFeederSyncup.execute(ReplicaFeederSyncup.java:184)
at com.sleepycat.je.rep.impl.node.Replica.initReplicaLoop(Replica.java:742)
at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoopInternal(Replica.java:505)
at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoop(Replica.java:440)
at com.sleepycat.je.rep.impl.node.RepNode.run(RepNode.java:1746)
2017-08-09 08:02:49.152 UTC INFO [ logic2-10.254.81.237] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
2017-08-09 08:02:49.153 UTC INFO [ logic2-10.254.81.237] RepNode main thread shutting down.
In the original replica which has performed DesignatePrimary, the je.info shows:
2017-08-09 08:02:48.382 UTC INFO [logic2-10.254.89.154] Feeder accepted connection from com.sleepycat.je.rep.utilint.net.SimpleDataChannel@7761796a
2017-08-09 08:02:48.393 UTC INFO [logic2-10.254.89.154] Feeder-replica handshake start
2017-08-09 08:02:48.395 UTC INFO [logic2-10.254.89.154] Feeder-replica logic2-10.254.81.237 handshake completed.
2017-08-09 08:02:48.395 UTC INFO [logic2-10.254.89.154] Feeder-replica logic2-10.254.81.237 syncup started. Feeder range: first=52,025,747,787 last=52,753,526,236 sync=52,753,526,236 txnEnd=52,753,526,236
2017-08-09 08:02:49.211 UTC INFO [logic2-10.254.89.154] Feeder-replica logic2-10.254.81.237 syncup ended. Elapsed time: 816ms
2017-08-09 08:02:49.211 UTC INFO [logic2-10.254.89.154] Matchpoint vlsn 52,753,505,132 requested by node: logic2-10.254.81.237(1) was outside the VLSN range: [52,025,747,787-52,753,526,236]
2017-08-09 08:02:49.212 UTC INFO [logic2-10.254.89.154] Shutting down feeder for replica logic2-10.254.81.237 Reason: Matchpoint vlsn 52,753,505,132 requested by node: logic2-10.254.81.237(1) was outside the VLSN range: [52,025,747,787-52,753,526,236] write time: 0ms Avg write time: 9us
2017-08-09 08:02:49.212 UTC INFO [logic2-10.254.89.154] Feeder Output soft shutdown initiated.
We can find that the original master report the matchpointVLSN of 52,753,505,132 , and the original replica's Feeder range: first=52,025,747,787 last=52,753,526,236 , but in the je.info of the original replica : Matchpoint vlsn 52,753,505,132 requested by node: logic2-10.254.81.237(1) was outside the VLSN range: [52,025,747,787-52,753,526,236] make us confused.
Can any one provide any suggestions?