DB Migration Project
I'm currently creating a utility that will migrate an older database to Berkeley for legacy customers. I'm finding that my conversion utility slows down significantly after some low number of millions of records. One key thing to note is that the Windows Server starts to become less responsive during the times that is slowed down, so I figure it might be the File System cache. At first we're talking 200,000 to 400,000 inserts in 5-10 seconds. But then that same 200K takes about 10 minutes after about 2 million records.
Our BDB configuration is as follows:
2 separate DBs with individual environments with 1 and 3 secondary databases respectfully. These separate DBs will have hundreds of millions of records potentially, so I can't fit it all into memory.
Cache Size: The first DB has cache size of 600MB and the second 200MB.
Environment Flags: DB_REGION_INIT | DB_DIRECT_DB | DB_DSYNC_DB (The later two because during the slowdown the system would start to become not responsive, so I've removed the File System cache from the equation.)
Env Open: DB_CREATE | DB_INIT_CDB | DB_INIT_MPOOL | DB_LOCKDOWN (Normally the env uses transactions, but I wanted to remove log files from the equation.)
Page Size: The primary DB has 4K page size and the secondaries (with the exception of the Sequence) are 8K. (I'd prefer not to change these since they are out in the field, but we can be flexible if it helps.)
Interesting things about the utility:
I read records in batches of about 200,000 from the old DB first (this is due to the indexes in the old DB, which are the same as the new DB), then write those records into the new DB, then move on to another batch. I didn't do simultaneous reads and inserts because I thought maybe my own synchronization techniques might be the bottleneck (and to isolate whether the slowness was on reads or puts), but it appears to not have made a lick of difference.
I call memp_trickle every 5 seconds with 30%.
I call memp_sync after every batch of 200,000.
After every 3 batches, I tried closing and reopening the environment which takes sub-seconds so I don't think this is necessary (after all, I call memp_sync.)
I thought I could get around this by not creating the secondary DBs during conversion and do it afterward, but all that does is push out the slow down farther.
Here is an example of db_stat -m on the environment of the second DB BEFORE the slowdown:
375MB 1KB 752B Total cache size
1 Number of caches
1 Maximum number of caches
375MB 8KB Pool individual cache size
0 Maximum memory-mapped file size
0 Maximum open file descriptors
0 Maximum sequential buffer writes
0 Sleep after writing maximum sequential buffers
0 Requested pages mapped into the process' address space
11M Requested pages found in the cache (99%)
6119 Requested pages not found in the cache
17852 Pages created in the cache
1489 Pages read into the cache
34843 Pages written from the cache to the backing file
0 Clean pages forced from the cache
0 Dirty pages forced from the cache
630 Dirty pages written by trickle-sync thread
19325 Current total page count
11619 Current clean page count
7706 Current dirty page count
65537 Number of hash buckets used for page location
10M Total number of times hash chains searched for a page (10736003)
2 The longest hash chain searched for a page
13M Total number of hash chain entries checked for page (13591003)
1 The number of hash bucket locks that required waiting (0%)
1 The maximum number of times any hash bucket lock was waited for (0%)
0 The number of region locks that required waiting (0%)
0 The number of buffers frozen
0 The number of buffers thawed
0 The number of frozen buffers freed
19361 The number of page allocations
0 The number of hash buckets examined during allocations
0 The maximum number of hash buckets examined for an allocation
0 The number of pages examined during allocations
0 The max number of pages examined for an allocation
3 Threads waited on page I/O
Pool File: DB02.db
8192 Page size
0 Requested pages mapped into the process' address space
3120842 Requested pages found in the cache (99%)
2 Requested pages not found in the cache
5101 Pages created in the cache
2 Pages read into the cache
4397 Pages written from the cache to the backing file
Pool File: DB01.db
8192 Page size
0 Requested pages mapped into the process' address space
3141470 Requested pages found in the cache (99%)
2 Requested pages not found in the cache
6024 Pages created in the cache
2 Pages read into the cache
5194 Pages written from the cache to the backing file
Pool File: DB00.db
4096 Page size
0 Requested pages mapped into the process' address space
2784429 Requested pages found in the cache (99%)
6113 Requested pages not found in the cache
6727 Pages created in the cache
1483 Pages read into the cache
25246 Pages written from the cache to the backing file
Pool File: DBSequence.db
512 Page size
0 Requested pages mapped into the process' address space
1664466 Requested pages found in the cache (99%)
2 Requested pages not found in the cache
0 Pages created in the cache
2 Pages read into the cache
6 Pages written from the cache to the backing file
Here is an example of db_stat -m on the environment of the second DB DURING the slowdown:
375MB 1KB 752B Total cache size
1 Number of caches
1 Maximum number of caches
375MB 8KB Pool individual cache size
0 Maximum memory-mapped file size
0 Maximum open file descriptors
0 Maximum sequential buffer writes
0 Sleep after writing maximum sequential buffers
0 Requested pages mapped into the process' address space
50M Requested pages found in the cache (99%)
61674 Requested pages not found in the cache
93639 Pages created in the cache
38529 Pages read into the cache
281962 Pages written from the cache to the backing file
43382 Clean pages forced from the cache
19635 Dirty pages forced from the cache
14696 Dirty pages written by trickle-sync thread
69134 Current total page count
19392 Current clean page count
49742 Current dirty page count
65537 Number of hash buckets used for page location
50M Total number of times hash chains searched for a page (50624302)
4 The longest hash chain searched for a page
97M Total number of hash chain entries checked for page (97904222)
5 The number of hash bucket locks that required waiting (0%)
4 The maximum number of times any hash bucket lock was waited for (0%)
0 The number of region locks that required waiting (0%)
0 The number of buffers frozen
0 The number of buffers thawed
0 The number of frozen buffers freed
132189 The number of page allocations
214720 The number of hash buckets examined during allocations
240 The maximum number of hash buckets examined for an allocation
63017 The number of pages examined during allocations
32 The max number of pages examined for an allocation
19 Threads waited on page I/O
Pool File: DB02.db
8192 Page size
0 Requested pages mapped into the process' address space
15M Requested pages found in the cache (99%)
8 Requested pages not found in the cache
25123 Pages created in the cache
8 Pages read into the cache
24190 Pages written from the cache to the backing file
Pool File: DB01.db
8192 Page size
0 Requested pages mapped into the process' address space
16M Requested pages found in the cache (99%)
8 Requested pages not found in the cache
29580 Pages created in the cache
8 Pages read into the cache
27919 Pages written from the cache to the backing file
Pool File: DB00.db
4096 Page size
0 Requested pages mapped into the process' address space
13M Requested pages found in the cache (99%)
61657 Requested pages not found in the cache
38936 Pages created in the cache
38512 Pages read into the cache
229827 Pages written from the cache to the backing file
Pool File: DBSequence.db
512 Page size
0 Requested pages mapped into the process' address space
7538529 Requested pages found in the cache (99%)
2 Requested pages not found in the cache
0 Pages created in the cache
2 Pages read into the cache
27 Pages written from the cache to the backing file
No matter what settings I use, I can't seem to avoid performance issues when scaling.
Also, randomly I get this:
Berkeley Error Callback: CDbBerkeley seek: 157085696: (38351 * 4096) + 0: The handle is invalid.
Berkeley Error: BerkeleyDbErrno=9 - Db::put: Bad file descriptor)
Any recommendations?