Skip to Main Content

Berkeley DB Family

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Infinite loop during recovery of JE 4.1.10 database environment

user4547579Feb 2 2012 — edited Feb 2 2012
Hi there,

We have a JE 4.1.10 database environment which is over 20GB in size. In order to improve performance we increased je.log.fileCacheSize so that we could cache all of the 10MB DB log file descriptors in memory and prevent JE from having to constantly open/close log files (the environment is only partially cached in memory and there are more than 2000 log files). Unfortunately, we failed to increase the file descriptor ulimit from the Linux default of 1024 and our application failed when the ulimit was reached.

Since then we've reverted the settings and increased the JVM heap size so that we can fully cache everything in memory again. However, we are having problems recovering the DB environment. It looks like the environment recovery proceeds through the 10 recovery steps but gets stuck while loading the log file utilization meta-data:

Stack trace #1:

"main" prio=10 tid=0x000000005622f000 nid=0x5d94 runnable [0x00000000419fe000]
java.lang.Thread.State: RUNNABLE
at java.io.RandomAccessFile.seek(Native Method)
at com.sleepycat.je.log.FileManager.readFromFileInternal(FileManager.java:1605)
- locked <0x00002aaaf80dca88> (a com.sleepycat.je.log.FileManager$1)
at com.sleepycat.je.log.FileManager.readFromFile(FileManager.java:1560)
at com.sleepycat.je.log.FileManager.readFromFile(FileManager.java:1498)
at com.sleepycat.je.log.FileSource.getBytes(FileSource.java:56)
at com.sleepycat.je.log.LogManager.getLogEntryFromLogSource(LogManager.java:861)
at com.sleepycat.je.log.LogManager.getLogEntry(LogManager.java:790)
at com.sleepycat.je.log.LogManager.getLogEntryAllowInvisibleAtRecovery(LogManager.java:751)
at com.sleepycat.je.tree.IN.fetchTarget(IN.java:1320)
at com.sleepycat.je.tree.BIN.fetchTarget(BIN.java:1367)
at com.sleepycat.je.dbi.CursorImpl.fetchCurrent(CursorImpl.java:2499)
at com.sleepycat.je.dbi.CursorImpl.getCurrentAlreadyLatched(CursorImpl.java:1545)
at com.sleepycat.je.dbi.CursorImpl.getNextWithKeyChangeStatus(CursorImpl.java:1692)
at com.sleepycat.je.dbi.CursorImpl.getNext(CursorImpl.java:1617)
at com.sleepycat.je.cleaner.UtilizationProfile.getFirstFSLN(UtilizationProfile.java:1262)
at com.sleepycat.je.cleaner.UtilizationProfile.populateCache(UtilizationProfile.java:1200)
at com.sleepycat.je.recovery.RecoveryManager.recover(RecoveryManager.java:221)
at com.sleepycat.je.dbi.EnvironmentImpl.finishInit(EnvironmentImpl.java:549)
- locked <0x00002aaaf8009868> (a com.sleepycat.je.dbi.EnvironmentImpl)
at com.sleepycat.je.dbi.DbEnvPool.getEnvironment(DbEnvPool.java:237)
at com.sleepycat.je.Environment.makeEnvironmentImpl(Environment.java:229)
at com.sleepycat.je.Environment.<init>(Environment.java:211)
at com.sleepycat.je.Environment.<init>(Environment.java:165)

Stack trace #2:

"main" prio=10 tid=0x000000005622f000 nid=0x5d94 runnable [0x00000000419fe000]
java.lang.Thread.State: RUNNABLE
at com.sleepycat.je.tree.IN.findEntry(IN.java:2086)
at com.sleepycat.je.dbi.CursorImpl.searchAndPosition(CursorImpl.java:2194)
at com.sleepycat.je.cleaner.UtilizationProfile.getFirstFSLN(UtilizationProfile.java:1242)
at com.sleepycat.je.cleaner.UtilizationProfile.populateCache(UtilizationProfile.java:1200)
at com.sleepycat.je.recovery.RecoveryManager.recover(RecoveryManager.java:221)
at com.sleepycat.je.dbi.EnvironmentImpl.finishInit(EnvironmentImpl.java:549)
- locked <0x00002aaaf8009868> (a com.sleepycat.je.dbi.EnvironmentImpl)
at com.sleepycat.je.dbi.DbEnvPool.getEnvironment(DbEnvPool.java:237)
at com.sleepycat.je.Environment.makeEnvironmentImpl(Environment.java:229)
at com.sleepycat.je.Environment.<init>(Environment.java:211)
at com.sleepycat.je.Environment.<init>(Environment.java:165)

It looks like it is spinning in UtilizationProfile.getFirstFSLN(). Examining the syscalls using strace shows that JE is looping around the same 3 log files over and over again (note the lseek calls on fd 147):

14719 lseek(147, 2284064, SEEK_SET) = 2284064
14719 read(147, "\310\f9\311\v\0\330\331\"\0:\0\0\0\2{\231\302\36\27\0\340\24\0\0\33\201\230\0\340\24\0"..., 4096) = 4096
14719 lseek(95, 3030708, SEEK_SET) = 3030708
14719 read(95, "B\17\36\373\v\0\350<.\0B\0\0\0\2{3\323\36\27\0+\21\0\0\210\221\230\0+\21\0"..., 4096) = 2469
...
14719 lseek(95, 3025672, SEEK_SET) = 3025672
14719 read(95, "#(\2571\v\0b*.\0\202\0\0\0\2{\26\323\36\27\0\0160\0\0f\221\230\0Z\22\0"..., 4096) = 4096
14719 lseek(74, 35084, SEEK_SET) = 35084
14719 read(74, "\361\34s\272\v\0\300\210\0\0h\0\0\0\2{\240\323\36\27\0\346\36\0\0\215\225\230\0\350\22\0"..., 4096) = 4096
...
14719 lseek(74, 90663, SEEK_SET) = 90663
14719 read(74, "\2\16P\341\v\0#;\1\0B\0\0\0\2{\275\323\36\27\0\366/\0\0003H.\0$\3\0"..., 4096) = 4096
14719 lseek(147, 2284064, SEEK_SET) = 2284064
14719 read(147, "\310\f9\311\v\0\330\331\"\0:\0\0\0\2{\231\302\36\27\0\340\24\0\0\33\201\230\0\340\24\0"..., 4096) = 4096
14719 lseek(95, 3030708, SEEK_SET) = 3030708
14719 read(95, "B\17\36\373\v\0\350<.\0B\0\0\0\2{3\323\36\27\0+\21\0\0\210\221\230\0+\21\0"..., 4096) = 2469

From the output of lsof we can determine the names of the log files associated with the above file descriptors, but I'm not sure if this is useful or not. We also ran DbVerifyLog against the entire DB log and it produced no results, so this indicates to me that the DB is not corrupt after all (although it could be that DbVerifyLog simply does not check the utilization profile meta data).

I had a look at the change log for 4.1.17 but couldn't see any relevant bugs. It's probably worth noting that this DB does not use HA, duplicates, or secondary indexes.

So I guess I have two questions:

1) Is this a bug? Presumably the DB should recover even in the event of a human error induced failure like this?
2) Can we recover the DB somehow, e.g. using DbDump/DbLoad?

Let me know if you need any additional information.

Cheers,

Matt
Comments
Locked Post
New comments cannot be posted to this locked post.
Post Details
Locked on Mar 1 2012
Added on Feb 2 2012
3 comments
153 views