Using Vdbench to verify data contents after clone/snapshot/xcopy/etc.
Using journal=yes:
Throughout the years some users have used Data Validation Journal Recovery to validate the data integrity of files or volumes copied through different methods, something like:
- Create data contents targeting volumeA while using 'journal=yes'.
- Copy volumeA to volumeB
- Using the journal files of volumeA run journal recovery targeting volumeB.
Journaling requires two separate synchronous writes to a journal file for each write done to the target volume so this can dramatically slow down your write throughput. Journal files can also become rather large; it writes two 8-byte journal records for each block written to the storage.
Warning:
This below ONLY works when you can count on Vdbench coming to normal termination, and of course the OS to properly flush any pending mmap file updates before/if it reboots.
When you can not count on this you MUST use journaling with its synchronous journal writes.
Note: I do not think I have published this, but journal recovery is not aware of dedup and will therefore consider i/o that was pending when Vdbench or the OS shut down as corrupted. I have an undocumented option that bypasses the verification of pending blocks, but that of course risks missing some real corruptions. So no dedup and journaling, unless of course you can guarantee Vdbench or the OS not failing, but then you may not NEED journaling.
Is there an other way?
Journaling was the only way until Vdbench503 when, by accident, it could be done differently and more efficient.
Vdbench Data Validation uses one byte per data block for a Data Validation Map storing information about what is written where (there is a huge chapter about this in the doc).
Until Vdbench502 this was stored in java heap space, and that, especially when using 32bit java, could cause java to run out of heap space.
Vdbench503 moved this map to a memory mapped file (mmap), this to alleviate pressure on the java heap.
A few weeks ago one of my users asked me "can you help test xcopy?", and that's when I realized that most of the time we do not NEED journaling, because as long as the memory mapped file is still there we can just tell Vdbench to continue using that same mmap file.
Slightly modifying above journal example:
Using validate=continue:
- Create data contents targeting volumeA while using validate=yes and telling Vdbench where to store the mmap file (journal=/xxx SD parameter).
- Copy volumeA to volumeB, maybe even create one or more copies of the mmap file.
- Pointing to the mmap file of volumeA, do a 100% sequential read run (seekpct=eof) of volumeB. And specify 'validate=continue_old_map' telling Vdbench to not erase the mmap file.
Note: it was during testing of this that I found a bug in the code which, for Vdbench generated read requests, did NOT validate the data, so use Vdbench50404rc2 or newer.
Multiple snapshots/copy/etc?
How to verify the data if you create ten copies of the same volume or file?
- Just create ten copies of the mmap file and then use those copies during the above mentioned 'seekpct=eof' run. Don't use the same physical mmap file for two different volumes though because you'll have them stepping on each other's toes.
To be done:
One question that I honestly have not paid attention to (yet?): if you have only modified 1000 blocks of a 1 million block volume, why read the whole volume?
One thing at the time. No promises.