Hi Ulli,
Whew - I did go cross-eyed at points but managed to follow what you where saying. Not because you were unclear but because you are starting to tax my brain capacity :smileywink:
I agree that vmkfstools would not (should not) be including the dirty blocks. I'm obviously not operating at the level you are, but if we look at what happens on a lazyzeroed VMDK as you have mentioned, when a block goes to be allocated by the guest OS, then the block is zeroed before being written to. The guest doesn't do this (confirm if I'm wrong on this) but rather ESXi. The actual state of the physical block is abstracted from the guest OS otherwise you could provision a VMDK, then use forensic tools inside a VM to extract blocks from the underlying storage.
Until it is written to the VMDK considers all the blocks associated with the VMDK as being effectively zero'd or perhaps better to say, available to be written to. It is possible on a new blank device the physical blocks actually are zero or they could contain dirty data from previously deleted files however from the VMDK perspective all it knows is that it has been allocated those blocks and when a write needs to occur the block is zero'd. (this is more to confirm my understanding is okay than to teach you anything new :smileyhappy: )
Based on this, then I also agree that it is likely that either vmkfstools, zero's the destination block as it copies it or links to /dev/zero - I think the /dev/zero is far more likely but have no proof. The end result at a file level would be the same, i.e. if the VMDK metadata believed a block is deallocated (not used yet by the VM for data) then it is ready to be zero'd and have data written to it if required.
If we break this apart a little more then it makes more sense for md5sum to operate in interpreted mode (I'd be almost tempted to call it file mode), i.e. only reading the actual data associated with the VMDK not the "dirty" data at the block level even if the actual data is a virtual zero'd block.
Even in a file system like NTFS, there will be slack space between sections of a file which doesn't contain the data from the file. Like you said, this is what forensic investigators love (and the bad guys as well) as you can find some amazing stuff in this supposedly wasted section of the file system. If this slack space was included when calculating file hashes, then we would likely never be able to get a matching hash.
I'm happy to pick this apart a little more and get some real life tests going. Filling in missing spots in knowledge is always good!
Hope I haven't muddled things with my statements above!
Glen