Hello All -
I am truly hoping that someone can help us unravel the mystery my group is currently trying to resolve.. It doesn't make much sense, but I will do my best to explain in succint terms -
Late in the afternoon, across a time span of 3 hours, 3 successive virtual machines reported that they had lost their underlying disks, and the message mentioned mounting an ISO file to install an operating system. 3 hours later, once the support team had begun triage, one of our engineers did a rescan of HBAs, and vCenter remarked that it was removing a number of datastores. My interpretation of that is that vCenter believed it no longer had access to the underlying LUNS, and therefore decided it would remove the datastores. At some point in the troubleshooting, we could see that the underlying LUNS were still presented to the ESX hosts (4.0), but even more interestingly, when navigating to "add storage", the LUNS appeared as unformatted disks which were ready to be added as new datastores. I should mention here that our back-end storage array is an HP XP series model.
We engaged with Vmware support and HP storage for hours upon hours, however no resolution was found. VMware did a data dump of one of the LUNS that had held all of the OS VMDKS of the affected vms, and saw that zeros were written to the blocks.
We are struggling to find out the cause of what has happened. If anyone has any ideas, I would be extraoridarily grateful for your feedback.
It definitely looks like a problem on the Storage Array side. Do you have any logs from the HP array ? A few things could have happened on the array --
1) Some one accidentally took the LUN offline and never put it back - In this case you would have a permanent device loss.This is sort of unrecoverable from the ESX's side.
2) The LUNs get un-mapped from the ESX accidentally. This again would result in a situation as above.
As you also mention that , you could see this as an unformatted disk all over again neither of the above has happened.
I see that the Volume,LUN is intact / the mapping is intact - the storage simply disappeared killing the VMs and presents itself as an unformatted disk now. This happens if the partition table information in corrupt or over-written with zeroes.
Can you also check if you had any sort of outages on the array side? Any faulty disks? Does the array show up any warnings? Was the datastore available to another host on another VC? Were you trying to do an upgrade from VMFS3 to 5 ?
Was the array trying to do some scheduled task , like snapshotting the LUN or something? ( I mean array side snapshotting). The LUN would also go offline if it exceeds the max capacity. Looking into this would help as well - This takes the LUN offline but all the data on it should still be available.
The probability of ESX corrupting the volume is of lower probability in this case.
Thanks so much for your reply. I can add some further detail -
** The three vms which were affected have one thing in common - they are part of a "thick" storage\RAID pool on the back end array. Meaning - 95% of our virtual machines now reside on THIN LUNS on the back end array. To confirm - we do not use thin vmdks in our Production environment, but rather, we have thin LUNS on thin RAID groups. These were three of the last vms which were still on datastores which were part of thick RAID\storage pools on the back end array. Because at least two if not all three were running MSCS, and due to scheduling with system owners, these vms were still running on the old pool.
As for the datstores, they were based on our old architecture - one datastore was OS, one was for PAGE, and another for DATA. It was specifically on the LUN which had held the OS datastore the VMware support ran the block-level dump which revealed all zeroes.
Understandably, the business wants a comprehensive post-incident review. The Storage team cannot report anything that appears amiss; no VMFS upgrades or system changes were scheduled during the time of the incident. Clearly something is missing in this picture, but we can't figure out what that something is.
A little more info - regarding datastore size - two datastores were 300GB, those were OS\PAGE, and the third, for transaction logs, was 200GB . (Again, we don't break our datastores out like this anymore, this is an architecture from 2.5 years ago).
Lastly - the vms which crashed and lost their vmdks - two if not all three of these machines had RDMS - we found that while the actual rdm files on the datastores were lost with the vmkds, the actual raw luns they were pointing to are okay.
Interesting : , from your reply -- I understand that no maintenance operations were done on the storage array side. (ie) no upgrades / no accidental deletes / and no hardware failures in the array as well. The size of the datastores dont look suspicious as well. If it was close to or greater than 64 TB , there might have been something that we could delve into. So its just the loss of the filesystem - everything else seems intact.
"the VMware support ran the block-level dump which revealed all zeroes." -- This shouldnt have happened either.
Are there any logs from the array or from the ESX that you think might be helpful ?