Corrupt log file, but how does db keep working?
We recently had a fairly devastating outage involving a hard drive failure, but are a little mystified about the mechanics of what went on with berkeleydb which I hope someone here can clear up.
A hard drive running a production instance failed because of a disk error, and we had to do a hard reboot to get the system to come back up and right itself (we are running RedHat Enterprise). We actually had three production environments running on that machine, and two came back just fine, but in one, we would get this during recovery:
BDBStorage> Running recovery.
BerkeleyDB> : Log file corrupt at LSN: [4906][8294478]
BerkeleyDB> : PANIC: Invalid argument
BerkeleyDB> : /usr/local/BerkeleyDB.4.8/lib/libdb_java-4.8.so(__os_stack+0x20) [0x2c23af2380]
BerkeleyDB> : /usr/local/BerkeleyDB.4.8/lib/libdb_java-4.8.so(__os_abort+0x15) [0x2c23aee9c9]
BerkeleyDB> : /usr/local/BerkeleyDB.4.8/lib/libdb_java-4.8.so(__env_panic+0xef) [0x2c23a796f9]
BerkeleyDB> : /usr/local/BerkeleyDB.4.8/lib/libdb_java-4.8.so(__env_attach_regions+0x788) [0x2c23aae82c]
BerkeleyDB> : /usr/local/BerkeleyDB.4.8/lib/libdb_java-4.8.so(__env_open+0x130) [0x2c23aad1e7]
BerkeleyDB> : /usr/local/BerkeleyDB.4.8/lib/libdb_java-4.8.so(__env_open_pp+0x2e7) [0x2c23aad0af]
BerkeleyDB> : /usr/local/BerkeleyDB.4.8/lib/libdb_java-4.8.so [0x2c23949dc7]
BerkeleyDB> : /usr/local/BerkeleyDB.4.8/lib/libdb_java-4.8.so(Java_com_sleepycat_db_internal_db_1javaJNI_DbEnv_1open+0xbc) [0x2c239526ea]
BerkeleyDB> : [0x2a99596e77]
We thought, well, perhaps this is related to the disk error, it corrupted a log file and then died. Luckily (or so we thought) we diligently do backups twice a day, and keep a week's worth around. These are made using the standard backup procedure described in the developer's guide, and whenever we've had to restore them, they have been just fine (we've been using our basic setup for something like 9 years now). However, as we retrieved backup after backup, going back three or four days, they all had similar errors, always starting with [4096]. Then we noticed an odd log file, numbered with 4096, which sat around in our logs directory ever since it was created. Eventually we found a good backup, but the customer lost several days' worth of work.
My question here is, how could a log file be corrupted for days and days but not be noticed, say during a checkpoint (which we run every minute or so)? Doesn't a checkpoint itself basically scan the logs, and shouldn't that have hit the corrupt part not long after it was written? The system was running without incident, getting fairly heavy use, so it really mystifies me as to how that issue could be sitting around for days and days like that.
For now all we can promise the customer is that we will automatically restore every backup as soon as it's made, and if something like this happens, we immediately try a graceful shutdown, and if that doesn't come back up, we automatically go back to the 12-hour-old backup. And perhaps we should be doing that anyway, but still, I would like to understand what happened here. Any ideas?