VMware Cloud Community
rphilippe
Contributor
Contributor
Jump to solution

SAN Crashed ... Some VMs won't start

Hello,

We just had a SAN Crash (one of the 2 controllers failed) but after that some VMs won't start (2 Active Directory servers). When I try to remove the swap file I get an error :

cat: myfile: Invalid argument

and in VMKernel I get :

vmkernel: gen 23, mode 1, owner 461f1953-33925b70-6270-001a4baa48f4 mtime 1176502190]

vmkernel: 0:00:31:21.058 cpu1:1075)FS3: 865: Error 0xbad0006 reading HB addr 48ffffcccd15ffff

I suppose they must be some kind of corruption in the table, is there a fsck or something like that I could use ?

Regards,

Rémi

Message was edited by:

rphilippe

Reply
0 Kudos
1 Solution

Accepted Solutions
grasshopper
Virtuoso
Virtuoso
Jump to solution

Since it appears you have a file lock situation, the easy answer is reboot the offending host holding the lock.

If you have more than one ESX host accessing the LUN which contains the VMs in question, you should identify the offending host by reviewing the vmkernel logs on the hosts. Then simply reboot the ESX host that holds the lock.

Example command to search for offending host:

grep -i lock /var/log/vmkernel

View solution in original post

Reply
0 Kudos
11 Replies
mcassiano
Contributor
Contributor
Jump to solution

Do you have problems only on the two AD vms or on the esx hosts as well ?

Did you check if on the esx hosts the /var/log partition is not full ?

Best regards

Marco

Reply
0 Kudos
rphilippe
Contributor
Contributor
Jump to solution

Well the errors are on the ESX host but the ESX hosts works OK.

Only those 2 VMs won't start and give this error (the facts that they are ADs is not very important).

And no the /var/log isn't full

Reply
0 Kudos
boydd
Champion
Champion
Jump to solution

Try re-registering them?

DB

DB
Reply
0 Kudos
rphilippe
Contributor
Contributor
Jump to solution

Already tried that.

I just saw I didn't paste the right line, here is the error :

vmkernel: 0:00:31:21.058 cpu1:1075)FS3: 865: Error 0xbad0006 reading HB addr 48ffffcccd15ffff

Reply
0 Kudos
grasshopper
Virtuoso
Virtuoso
Jump to solution

0xBAD0006 (a.k.a. "195887110") translates to "Limit exceeded".

I would also check the vmware.log files from the VM's config file directory. There may be more specific information there.

You should probably make a backup of the disks in question (i.e. vmkfstools -i ).

Then, you may consider attaching the failing .vmdk as a secondary disk on a helper VM (of the same OS flavor) and then performing a chkdsk /r on the failing disk. Then try to attach the disk back to the original VM and boot to it.

rphilippe
Contributor
Contributor
Jump to solution

I'll give a try connecting it to another machine.

But how come I get a "Limit exceeded" when trying to delete a file on the VMFS ?

Reply
0 Kudos
grasshopper
Virtuoso
Virtuoso
Jump to solution

Not sure... but if the files are related to the VM, you'll want to ensure that the VM is completely powered off. If the VM is in a hung state, you'll need to reboot the host (or kill -9 PID_OF_VM).

Then perform a mv .

Then try and delete the file.

Reply
0 Kudos
rphilippe
Contributor
Contributor
Jump to solution

The VM is stopped and I can't connect the vmdk to another VM ...

I can't even delete the files I get the same error, any ideas ?

Also where can I find the 0xbad significations ?

Reply
0 Kudos
rphilippe
Contributor
Contributor
Jump to solution

Forgot to post this ...

Here is the complete error, it does the same even when I want to delete the file

Jul 4 16:03:53 esx-g-8 vmkernel: 0:00:24:13.393 cpu2:1037)FS3: 1692: Checking if lock holders are live for lock [type 10c00001 offset 30033920 v 5, hb offset 4105728

Jul 4 16:03:53 esx-g-8 vmkernel: gen 29, mode 1, owner 4688d97e-3a14f09c-5ae5-001b782e89c0 mtime 1183440992]

Jul 4 16:03:57 esx-g-8 vmkernel: 0:00:24:17.395 cpu2:1037)FS3: 865: Error 0xbad0006 reading HB addr 48ffffcccd15ffff

Reply
0 Kudos
grasshopper
Virtuoso
Virtuoso
Jump to solution

Since it appears you have a file lock situation, the easy answer is reboot the offending host holding the lock.

If you have more than one ESX host accessing the LUN which contains the VMs in question, you should identify the offending host by reviewing the vmkernel logs on the hosts. Then simply reboot the ESX host that holds the lock.

Example command to search for offending host:

grep -i lock /var/log/vmkernel

Reply
0 Kudos
grasshopper
Virtuoso
Virtuoso
Jump to solution

Also where can I find the 0xbad significations ?

I have published the error code mappings for common versions of ESX at the following locations:

Soft copies:

http://www.vmguru.com/files/10/whitepapers/entry9.aspx

Interactive version:

http://www.vmprofessional.com/index.php?content=resources