Skip to Main Content

Berkeley DB Family

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Slow startup - even after agressive checkpointing - how to troubleshoot?

kbrankoJul 5 2013 — edited Aug 22 2013

Hi,

We are evaluating Berkley DB for Java and we run into problem that recovery time is very very slow.

We first tried to speed up recovery using  je.checkpointer.highPriority, je.checkpointer.wakeupInterval, etc... - no luck - startup stil is very inefficient. It takes about 4 hours to recover and start 10 databases each which has less than 10M records.

So in order to understand the problem we built a small test app which will checkpoint and clean logs in the main application thread (and disable checkpoint and cleaner threads). So we set that we call bdbEnv.checkpoint(checkpointConfig) with setMinimizeRecoveryTime and setForce.

Again recovery was super slow: even if we were calling bdbEnv.cleanLog and bdbEnv.checkpoint every 3000 records. Each records is less than 1KB long - uniq keys.  Note that in our test each thread has its own database handle and its own transaction for adding records.

Now, the interesting thing, we noticed that during the startup (that is recovery of that environment) read rate of application was *only* 4 KB/s - and machine running the test was completely idle (quad core with quad-core).

Here is our stack trace (showing that we are doing recovery):

        at java.io.RandomAccessFile.readBytes(Native Method)

        at java.io.RandomAccessFile.read(RandomAccessFile.java:355)

        at com.sleepycat.je.log.FileManager.readFromFileInternal(FileManager.java:1918)

        - locked <0x00000007d2da9b80> (a com.sleepycat.je.log.FileManager$DefaultRandomAccessFile)

        at com.sleepycat.je.log.FileManager.readFromFile(FileManager.java:1869)

        at com.sleepycat.je.log.FileReader$ReadWindow.fillFromFile(FileReader.java:1146)

        at com.sleepycat.je.log.FileReader$ReadWindow.slideAndFill(FileReader.java:1048)

        at com.sleepycat.je.log.FileReader.setBackwardPosition(FileReader.java:572)

        at com.sleepycat.je.log.FileReader.getLogEntryInReadBuffer(FileReader.java:429)

        at com.sleepycat.je.log.FileReader.readNextEntryAllowExceptions(FileReader.java:256)

        at com.sleepycat.je.log.FileReader.readNextEntry(FileReader.java:229)

        at com.sleepycat.je.recovery.RecoveryManager.undoLNs(RecoveryManager.java:1006)

        at com.sleepycat.je.recovery.RecoveryManager.buildTree(RecoveryManager.java:530)

        at com.sleepycat.je.recovery.RecoveryManager.recover(RecoveryManager.java:198)

        at com.sleepycat.je.dbi.EnvironmentImpl.finishInit(EnvironmentImpl.java:611)

        - locked <0x000000077e4994d8> (a com.sleepycat.je.dbi.EnvironmentImpl)

        at com.sleepycat.je.dbi.DbEnvPool.getEnvironment(DbEnvPool.java:208)

The DbPrintLog output of one of our test database for which recovery takes very long time like this:

<DbPrintLog>

Log statistics:

                type               total         provisional               total                 min                 max                 avg             entries

                                   count               count               bytes               bytes               bytes               bytes         as % of log

               MapLN                 346                   0             683,346                 501               4,337               1,974                   0

              NameLN                   6                   0                 184                  30                  34                  30                   0

       FileSummaryLN               3,939                   0             766,896                  24               7,084                 194                   0

                  IN                 869                   0         172,255,390                  46             353,817             198,222                 4.9

                 BIN               3,670               3,648       1,690,627,274                  42             708,998             460,661                  48

              DbTree                  57                   0             107,333                 760               3,124               1,883                   0

              Commit                   9                   0                 262                  29                  30                  29                   0

           CkptStart                  52                   0               1,352                  26                  26                  26                   0

             CkptEnd                  53                   0               4,665                  55                 105                  88                   0

            BINDelta              33,437              32,383       1,186,036,703                  41             162,988              35,470                33.6

               Trace                 645                   0             107,702                  46                 381                 166                   0

          FileHeader                 341                   0              12,958                  38                  38                  38                   0

           INS_LN_TX             157,699                   0         220,123,663               1,390               1,396               1,395                 6.2

              UPD_LN             183,600                   0         254,469,600               1,386               1,386               1,386                 7.2

           key bytes                                                 467,579,630                                                                          (13.3)

          data bytes                                                           0                                                                             (0)

Total bytes in portion of log read: 3,525,197,328

Total number of entries: 384,723

How to troubleshoot these kind of problem? Is there a way to speed up the recovery?

If we want to do checkpoint and clean logs in the main application thread, how often do we need to check point? Or maybe our data is not really good for Berkeley DB?

This post has been answered by Bogdan Coman-Oracle on Jul 11 2013
Jump to Answer
Comments
Locked Post
New comments cannot be posted to this locked post.
Post Details
Locked on Sep 19 2013
Added on Jul 5 2013
7 comments
3,771 views