Re: Cluster crash in progress - what is the safest...

continuum · ‎02-11-2015

Cluster crash in progress - what is the safest command to export partitiontables and VMFS-headers that will not trigger a rescan ?

Does anybody know if a dd-command against a Fibrechannel VMFS 5 volume can result in a rescan of the HBAs ?
I expect that a rescan would start a cascade of hosts loosing paths ....

Current state: first hosts in each cluster are already non-responsive to vCenter - lots of VMs already inaccessible .... so extreme caution is necessary.
We started to poweroff all VMs that do not react to vCenter now - so that we can reboot all affected hosts as soon as possible.

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

continuum · ‎02-11-2015

Everything fine again.
At 18:00 we almost expected that 106 Luns would get lost during next few hours.
At 23:00 all unresponsive hosts and VMs were rebooted - no VMs were lost.

Only damage: one 2 TB LUN with VMFS heartbeatcorruption

In case anybody runs into something like this where commands like partedUtil getptbl or a dd-command against a LUN take forever or do not work at all ...
this method seems to reduce the extra stress for the esxihosts:

LinuxLiveCD uses sshfs into the ESXi - then you can create vmfs-header dumps or check partitiontables by addressing them indirectly and readonly

on Linux:
mkdir /esxi
sshfs root@esxi-ip:/ /esxi
dd if=/esxi/dev/disks/naa.* of=/tmp/vmfs-headers.1536.dd bs=1M count=1536

While we were at work VMware support logged in as well : first suggestion they offered was : rescan for new HBAs and VMFS-volumes and then read the vmkernel.logs to inspect the errormessages.
Good that we did not listen ...

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

larstr · ‎02-12-2015

Good you found out.

Your solution with the sshfs+dd trick is one of the nicest ones I've seen in a while

One never knows when such tricks are needed.

Lars

continuum · ‎02-12-2015

Yep - that way to address the storage of a running ESXi is something that is really useful.
Before I started to use that I always had to ask my customers to power off one ESXi and reboot it into Linux when someone needed help with a recovery task.

Nowaday I am way more flexible and can offer to start Recoverywork without the need to power off production VMs.
Not really ideal but if the user wants to I can even work on recoveries from LUNs that are still in production.

As a highly welcomed side-effect this approach also reduces the chance to damage anything with a bad command ...

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

All

Cluster crash in progress - what is the safest command to export partitiontables and VMFS-headers that will not trigger a rescan ?