VMware Cloud Community
llamatron
Contributor
Contributor

VMotion failure due to alleged metadata corruption

One 3.5 host out of eight will not allow me to VMotion VMs onto it. I eventually get an 'Operation timed out' message. The VM can be powered up on the offending host it just can't be migrated to it live. When I looked in the VMKernel log I found a bucket load of these:

Apr 2 11:14:17 toast vmkernel: 0:00:20:26.816 cpu3:1100)WARNING: Migrate: 1345: 1207131238094348: Migration considered a failure by the VMX. It is most likely a timeout, but check the VMX log for the true error.

Apr 2 11:14:17 toast vmkernel: 0:00:20:26.816 cpu3:1100)WARNING: Migrate: 1242: 1207131238094348: Failed: Migration determined a failure by the VMX (0xbad0091) @0x9e28c5

Apr 2 11:15:33 toast vmkernel: Invalid totalResources 0 (cluster 0).[type 1] Invalid nextFreeIdx 0 (cluster 0).0:00:21:42.771 cpu3:1035)WARNING: Res3: 1930: FS 460e7753-0265ee07-ea7e-00118513517f may be damaged. Resource cluster metadata corruption detected

Apr 2 11:15:33 toast vmkernel: Invalid totalResources 0 (cluster 0).[type 1] Invalid nextFreeIdx 0 (cluster 0).0:00:21:42.774 cpu3:1035)WARNING: Res3: 1930: FS 460e7753-0265ee07-ea7e-00118513517f may be damaged. Resource cluster metadata corruption detected

Apr 2 11:15:33 toast vmkernel: Invalid totalResources 0 (cluster 0).[type 1] Invalid nextFreeIdx 0 (cluster 0).0:00:21:42.774 cpu3:1035)WARNING: Res3: 1930: FS 460e7753-0265ee07-ea7e-00118513517f may be damaged. Resource cluster metadata corruption detected

Apr 2 11:15:33 toast vmkernel: Invalid totalResources 0 (cluster 0).[type 1] Invalid nextFreeIdx 0 (cluster 0).0:00:21:42.819 cpu3:1036)WARNING: Res3: 1930: FS 460e7754-1c626222-2c35-00118513517f may be damaged. Resource cluster metadata corruption detected

Apr 2 11:15:33 toast vmkernel: Invalid totalResources 0 (cluster 0).[type 1] Invalid nextFreeIdx 0 (cluster 0).0:00:21:42.819 cpu3:1036)WARNING: Res3: 1930: FS 460e7754-1c626222-2c35-00118513517f may be damaged. Resource cluster metadata corruption detected

Apr 2 11:15:33 toast vmkernel: Invalid totalResources 0 (cluster 0).[type 1] Invalid nextFreeIdx 0 (cluster 0).0:00:21:42.819 cpu3:1036)WARNING: Res3: 1930: FS 460e7754-1c626222-2c35-00118513517f may be damaged. Resource cluster metadata corruption detected

It's always the same two disk serials it gives even when I try and vmotion a VM that lives on a totally different partition.

I've looked in the VMKernel log of some of the other hosts and they have maybe one or two of these messages but have no problem with VMotion. Do I have a couple of corrupt VMFS partitions and if so why is it that only one host is bothered about it?

Ta in advance

Mark

Reply
0 Kudos
11 Replies
llamatron
Contributor
Contributor

Anyone?

Reply
0 Kudos
jgalexan
Enthusiast
Enthusiast

What is your environment like? What type of shared storage are you using? How many Hosts do you have? Does this host have Prod VMs on it currently? Have you tried to remove this host from Virtual Center and add it back in? What is different about this server from the others?

Reply
0 Kudos
Natsidan
Enthusiast
Enthusiast

can you vmkping the vmkernel ip address??

Reply
0 Kudos
llamatron
Contributor
Contributor

Yes I can.

Although until you'd asked I'd never heard of vmkping.

Reply
0 Kudos
llamatron
Contributor
Contributor

The problem has changed a little. I still can't vmotion VMs across to this particular host but it's not longer erroring due to metadata corruption. However, the vmkermel log file on all servers is loaded with corrupt metadata messages for the same two partitions.

I logged the problem with VMWare this morning and they got back to me really quickly. Currently there are no tools to fix the corruption so all I can do is migrate the VMs and reformat the corrupt partitions.

Reply
0 Kudos
Natsidan
Enthusiast
Enthusiast

OK thats usefull to know regarding the tools and corruption. Are the Luns newly created do you know how they became corrupted?

Reply
0 Kudos
kjb007
Immortal
Immortal

What kind of array are you using? Also, what host type is listed on the array as the os type for the LUN?

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
llamatron
Contributor
Contributor

Not a clue I'm afraid.

Reply
0 Kudos
llamatron
Contributor
Contributor

We're using a pair of NetApp filers running as a mirrored pair over fibre channel. And yes, VMWare is selected as the OS type.

One thing VMWare told me was that they are working on tools to repair corruption like this. They want me to create a dump of the affected partitions so they have an example to work on.

Which is nice.

Reply
0 Kudos
kjb007
Immortal
Immortal

Ok, good to know. I think the safest approach is what you're using, to get the vm's off while you can, and rebuild the vmfs there. I would definitely feel more comfortable recreating vmfs at this point.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
dkimber
Contributor
Contributor

Hi there,

We have this issue and this thread is the only info I can find on the error. We had VMware support online today and they had no answer but it seems you got further. Did you ever hear back from VMware regarding the tool they were to create? Did you ever find the cause of the issue? We have Netapp filers also but we are of the assumption they aren't to blame yet the VMware support was pointing to them. Your support at least took responsibility for the issue.

Please let me know what you can.

Regards

DK

Reply
0 Kudos