VMware Cloud Community
Hompie
Contributor
Contributor
Jump to solution

ESX 3.0.1: VMFS3 lun disappeared - VM gone - Analysis tips?

Hi,

ESX 3.0.1. While doing a VMotion one of the machines reported a problem about not being able to access the vmx file.

After several trials and as we couldn;t see an error we stopped the VM, did a reboot, then it didn't boot anymore.

At the same time we lost access to the lun, even though it's still visible with a lun id there is no more VMFS3 file system mapped to it.

So we see teh lun having 100% free, and can create a new VMFS3 if we want.

In other words the VMFS3 is gone. How, we don;t know.

var/log/messages shows nothing of intrest.

vmkernel log shows :

Nov 16 11:56:45 HOST001 vmkernel: 6:00:32:12.497 cpu2:1067)World: vm 1067: 3864: Killing self with status=0x0:Success

Nov 16 11:57:23 HOST001 vmkernel: 6:00:32:50.562 cpu5:1037)LVM: 2294: Could not open device vmhba1:0:0:1, vol \[4552c6d7-8e1ea291-a708-00137258, 4552c6d7-8e1ea291-a708-00137258b8ba, 1]: Failure

Nov 16 11:57:23 HOST001 vmkernel: 6:00:32:50.723 cpu5:1037)FSS: 343: Failed with status 0xbad000e for f530 28 1 4552c6d8 8b2a5aa 1300125b bab85872 0 0 0 0 0 0 0

Nov 16 11:57:23 HOST001 vmkernel: 6:00:32:50.723 cpu5:1037)WARNING: Fil3: 1564: Failed to reserve volume f530 28 1 4552c6d8 8b2a5aa 1300125b bab85872 0 0 0 0 0 0 0

Nov 16 11:57:23 HOST001 vmkernel: 6:00:32:50.723 cpu5:1037)FSS: 343: Failed with status 0xbad000e for f530 28 2 4552c6d8 8b2a5aa 1300125b bab85872 4 1 0 0 0 0 0

Nov 16 11:57:23 HOST001 vmkernel: 6:00:32:50.723 cpu5:1037)LVM: 2294: Could not open device vmhba1:0:0:1, vol \[4552c6d7-8e1ea291-a708-00137258, 4552c6d7-8e1ea291-a708-00137258b8ba, 1]: Failure

Nov 16 11:57:23 HOST001 vmkernel: 6:00:32:50.729 cpu5:1037)FSS: 343: Failed with status 0xbad000e for f530 28 1 4552c6d8 8b2a5aa 1300125b bab85872 0 0 0 0 0 0 0

Nov 16 11:57:23 HOST001 vmkernel: 6:00:32:50.729 cpu5:1037)WARNING: Fil3: 1564: Failed to reserve volume f530 28 1 4552c6d8 8b2a5aa 1300125b bab85872 0 0 0 0 0 0 0

Nov 16 11:57:23 HOST001 vmkernel: 6:00:32:50.729 cpu5:1037)FSS: 343: Failed with status 0xbad000e for f530 28 2 4552c6d8 8b2a5aa 1300125b bab85872 4 1 0 0 0 0 0

Nov 16 11:57:23 HOST001 vmkernel: 6:00:32:50.734 cpu5:1037)LVM: 2294: Could not open device vmhba1:0:0:1, vol \[4552c6d7-8e1ea291-a708-00137258, 4552c6d7-8e1ea291-a708-00137258b8ba, 1]: Failure

Nov 16 11:57:23 HOST001 vmkernel: 6:00:32:50.741 cpu5:1037)FSS: 343: Failed with status 0xbad000e for f530 28 1 4552c6d8 8b2a5aa 1300125b bab85872 0 0 0 0 0 0 0

The VM is already restored on another LUn for the time being, so we have some time to analyse the problem, byt the lun is really empty, what happened?

0 Kudos
23 Replies
Saturnous
Enthusiast
Enthusiast
Jump to solution

mainly taken from http://www.vmware-tsx.com/download.php?asset_id=50 + my own experience

find the linux devicename of your disappeared lun

running esxcfg-vmhbadevs should do it

- in doubt examine dmesg output

- look if fdisk -lu shows a empty or strange lun

fdisk /dev/sd[x]

n (to create a new partition)

p (to create a primary partition)

1 (to create the 1st partition)

smash enter to keep the default value

smash enter to keep the default value

t (to change the type of partition)

fb (to set the partition as VMFS)

x (to move to expert mode)

b (to change the beginning of the partition) (VALUE OF ALLIGMENT ! mostly 63 or 128)

w (to save)

vmkfstools -V (for rescan)

Count slowly to 20 send a prayer to Saxnot or in whatever you belief and check availability of the datastore and what it contain.

ALLIGMENT is always a problem to find the correct vallue as hint i would suggest allways to snapshoot the lun on storage level and look in the

/proc/vmware/scsi/vmhba[x]/[y]_[z] file if there is a hint about the beginning if no look at a similar LUN (size / storagetype / diskgroup)

but try and error didnt do often any harm :smileygrin:

its allways a good tip to stop any managment agents before do such things of the serviceconsole .. e.g. the cmafcad process SHOULD NEVER RUN unless you have ONLY one or more MSA 1x00 and nothing other in your SAN - and is any Windowshost (Storage appliance - VCB Proxy) which see this LUNs in you SAN where you forgot run "diskpart automount disable"

0 Kudos
sachC
Contributor
Contributor
Jump to solution

thanks very much, really appericiate it.

0 Kudos
Jwoods
Expert
Expert
Jump to solution

Saturnous thanks a million for putting this together! An additional step to verify whether the partition is overwritten is running the command strings /dev/| more. For example, strings /dev/sdg | more. You can space through and generally the first couple of lines will display the VMFS volume name. At least hope that it displays the volume name! Further down it should display VM guest names and vmx data for those VMs.

thanks again Saturnous and everyone who contributed...see u in vegas!!!

0 Kudos
mdsullivan
Contributor
Contributor
Jump to solution

When these commands are performed, does it format the partition? I have this problem with 3.5 and have 20 or so VMs on this one LUN... Just wondering if I'm going to lose everything or if the VMs stay on the underlying disk.

0 Kudos