Cluster crash in progress - what is the safest command to export partitiontables and VMFS-headers that will not trigger a rescan ?
Does anybody know if a dd-command against a Fibrechannel VMFS 5 volume can result in a rescan of the HBAs ?
I expect that a rescan would start a cascade of hosts loosing paths ....
Current state: first hosts in each cluster are already non-responsive to vCenter - lots of VMs already inaccessible .... so extreme caution is necessary.
We started to poweroff all VMs that do not react to vCenter now - so that we can reboot all affected hosts as soon as possible.
Everything fine again.
At 18:00 we almost expected that 106 Luns would get lost during next few hours.
At 23:00 all unresponsive hosts and VMs were rebooted - no VMs were lost.
Only damage: one 2 TB LUN with VMFS heartbeatcorruption
In case anybody runs into something like this where commands like partedUtil getptbl or a dd-command against a LUN take forever or do not work at all ...
this method seems to reduce the extra stress for the esxihosts:
LinuxLiveCD uses sshfs into the ESXi - then you can create vmfs-header dumps or check partitiontables by addressing them indirectly and readonly
on Linux:
mkdir /esxi
sshfs root@esxi-ip:/ /esxi
dd if=/esxi/dev/disks/naa.* of=/tmp/vmfs-headers.1536.dd bs=1M count=1536
While we were at work VMware support logged in as well : first suggestion they offered was : rescan for new HBAs and VMFS-volumes and then read the vmkernel.logs to inspect the errormessages.
Good that we did not listen ...
Good you found out.
Your solution with the sshfs+dd trick is one of the nicest ones I've seen in a while
One never knows when such tricks are needed.
Lars
Yep - that way to address the storage of a running ESXi is something that is really useful.
Before I started to use that I always had to ask my customers to power off one ESXi and reboot it into Linux when someone needed help with a recovery task.
Nowaday I am way more flexible and can offer to start Recoverywork without the need to power off production VMs.
Not really ideal but if the user wants to I can even work on recoveries from LUNs that are still in production.
As a highly welcomed side-effect this approach also reduces the chance to damage anything with a bad command ...