Vmotion failing at 10%

DGWS · ‎06-24-2009

Hi

I'm getting a issue where VM's are failing to vmotion at 10% i've had a look thru the network settingi and all seems ok I think it's more a storage issues. I'mgetting the following errors in the vmkernal.log

Jun 25 09:16:12 MSC05ESX-VMSC vmkernel: 115:18:46:05.062 cpu0:1168)StorageMonitor: 196: vmhba1:1:0:0 status = 24/0 0x0 0x0 0x0

Jun 25 09:17:13 MSC05ESX-VMSC vmkernel: 115:18:47:05.736 cpu0:1024)StorageMonitor: 196: vmhba1:1:0:0 status = 24/0 0x0 0x0 0x0

Jun 25 09:17:13 MSC05ESX-VMSC vmkernel: 115:18:47:05.762 cpu0:1024)StorageMonitor: 196: vmhba1:1:0:0 status = 24/0 0x0 0x0 0x0

Jun 25 10:22:24 MSC05ESX-VMSC vmkernel: 115:19:52:16.434 cpu5:1041)Config: 416: "HostLocalSwapDirEnabled" = 0, Old Value: 0, (Status: 0x0)

Jun 25 10:33:13 MSC05ESX-VMSC vmkernel: 115:20:03:05.565 cpu0:1228)StorageMonitor: 196: vmhba1:1:0:0 status = 24/0 0x0 0x0 0x0

Jun 25 10:35:50 MSC05ESX-VMSC vmkernel: 115:20:05:43.018 cpu0:1024)StorageMonitor: 196: vmhba1:1:0:0 status = 24/0 0x0 0x0 0x0

Any idea on what the issue might be ?

weinstein5 · ‎06-24-2009

I do not think it is a storage problem since the virtual disks stay where they are at when vmotioning - typically when vmotion fails at 10% it is a communication problem between the vmkernel nics that are enabled for vmotion? Are your servers ESX or ESXi? If they are ESX go to the command line and issues the vmkping command and see if your can vmkping the vmkernel on the othe ESX host? Insure that the vmkernel ports are on the same subnet-

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

Troy_Clavell · ‎06-24-2009

here's a KB article worth checking out

http://kb.vmware.com/kb/1003734

Lofty · ‎06-24-2009

How's your hardware? The source and destination ESX hosts need to have a number of things that are compatible, include CPU Makes and Families (to a certain degree).

Thanks

Lofty

PS. If you found my response helpful, think about awarding some points.

Thanks Lofty PS. If you found my response helpful, think about awarding some points.

fyi · ‎06-24-2009

24/0 - indicates reservation conflict on the VMfs lun - make sure the LUN is not hard locked (my won term for SCSI-3 reservation on the storage array) - you may wish to get the SAN(I assume its SAN) guys

if there is no persistent lock on the storage array - then you may have a bigger(by which will take much more time to put a long term solution)

ask the san guys as to what hosts are conflicting in an attempt toreserv the lun - check /var/log/vmkernel on the other hostsin the farm - if its like unable to poweron - no swap file thing - then its an HA problem - reply back - i will walk you through further

I hope this helps

cheers!

DGWS · ‎06-24-2009

Thanks for your updates. I am able to vmkping between the 2 of them no worries. There is only 2 hosts int he cluster and can ping to and from both of them.

I think there is a reservation on the storage side as one of you have said. I'm unable to find wich host is causing it. I'm pretty sure it was working a whilst back when i first built the cluster. I hadn't done anything with the cluster for a while and now i'm trying to upgrade it to U4. I've got a number of other clsuters aroudn the world working fine.

Thanks

0ctal · ‎06-25-2009

I've had this in the past and the issue turned out to be ACLs preventing ICMP ping either to the ESX servers default gateway...

DGWS · ‎06-25-2009

Thanks, I just confirmed i'm able to ping the default gateway on both boxes.

vGuy · ‎06-25-2009

Hi - were you able to find the host holding the locks?

Also, check the SC is not running out of memory.

-Thanks.

Vishy1 · ‎06-25-2009

Also check permisson on SAN, we had similar issues and was permission related.

If you found this information useful, please consider awarding points for Correct or Helpful.

DGWS · ‎06-25-2009

I'm not sure how to determine what is holding the reservations on the LUN. there is only 1 VMFS volue presented to the cluster at the moment.

Here is some log infoo from the vmkernal on the other host

tail -f /var/log/vmkernel

Jun 26 10:25:21 MSC04ESX-VMSC vmkernel: 2:20:10:56.143 cpu4:1512)Sched: vm 1513: 5366: moved group 198 to be under group 17

Jun 26 10:25:21 MSC04ESX-VMSC vmkernel: 2:20:10:56.158 cpu4:1512)Swap: vm 1513: 2169: extending swap to 1048576 KB

Jun 26 10:25:21 MSC04ESX-VMSC vmkernel: 2:20:10:56.535 cpu5:1512)Migrate: vm 1513: 7338: Setting migration info ts = 1245975808481713, src ip = <10.11.97.12> dest ip = <0.0.0.0> Dest wid = -1 using SHARED swap

Jun 26 10:25:21 MSC04ESX-VMSC vmkernel: 2:20:10:56.535 cpu5:1512)World: vm 1514: 900: Starting world migSendHelper-1513 with flags 1

Jun 26 10:25:21 MSC04ESX-VMSC vmkernel: 2:20:10:56.535 cpu5:1512)World: vm 1515: 900: Starting world migRecvHelper-1513 with flags 1

Jun 26 10:26:21 MSC04ESX-VMSC vmkernel: 2:20:11:56.547 cpu6:1512)WARNING: Migrate: 1346: 1245975808481713: Migration considered a failure by the VMX. It is most likely a timeout, but check the VMX log for the true error.

Jun 26 10:26:21 MSC04ESX-VMSC vmkernel: 2:20:11:56.547 cpu6:1512)WARNING: Migrate: 1243: 1245975808481713: Failed: Migration determined a failure by the VMX (0xbad0091) @0xa048e5

Jun 26 10:26:21 MSC04ESX-VMSC vmkernel: 2:20:11:56.548 cpu6:1512)Sched: vm 1513: 1031: name='vmm0:MSC01AAC'

Jun 26 10:26:21 MSC04ESX-VMSC vmkernel: 2:20:11:56.548 cpu6:1512)CpuSched: vm 1513: 13864: zombified unscheduled world: runState=NEW

Jun 26 10:26:21 MSC04ESX-VMSC vmkernel: 2:20:11:56.548 cpu6:1512)World: vm 1513: 2488: deathPending set; world not running, scheduling reap

vGuy · ‎06-25-2009

interesting dest ip 0.0.0.0, is it expected?

Jun 26 10:25:21 MSC04ESX-VMSC vmkernel: 2:20:10:56.535

cpu5:1512)Migrate: vm 1513: 7338: Setting migration info ts =

1245975808481713, src ip = <10.11.97.12> *dest ip =

<0.0.0.0>* Dest wid = -1 using SHARED swap

Can you try a cold migration. it should work.

Let us know the results.

weinstein5 · ‎06-25-2009

Do you receive any errors in the VI Client? How loaded are your hosts? I have also seen this with an improperly set reservation on a resource pool but you get a not enough resources error in the VI Client

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

DGWS · ‎06-25-2009

I have done a cold migrations and they worked fine to and from both hsots. The current load on the cluster is only about 15-20%. The only error I get in the vicleint is that it times out after about 5 or 10 minutes.

Lofty · ‎06-25-2009

Can I ask you to post up the hardware specifications of the hosts?

Specifically each hosts CPU make/model/type etc

Thanks

Lofty

PS. If you found my response helpful, think about awarding some points.

Thanks Lofty PS. If you found my response helpful, think about awarding some points.

weinstein5 · ‎06-26-2009

What happens if you vmotion the other way - same error message?

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

savantsingh · ‎06-26-2009

Have you got multiple datastores.... can you move the machine to a different datastore in the cluster, fire it up on the same host and then do the migration again(vmotion)

If you have HA in place ... are you hosts added with IP addresses or FQDNs?

Try this and lemme know...

Hope this helps.

If you found this information useful, please consider awarding points for "Correct" or "Helpful".

firestartah · ‎06-26-2009

Are you running it with high priority/reserve cpu for optimal VMotion performance or low priority/perform with available resources? As possible your limit on the resource pool either the host your migrating from or too cant reserve that amount of processing to allow the migration? You said it works cold migrating so it sounds like a resourcing allocation problem to me. Maybe try run it with available resources option. Also maybe try migrating it to the top level of the esx host instead of putting it into a resource pool as this may be limiting you especially if there is a large amount of machine in the destination location.

If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".

If you found this or other information useful, please consider awarding points for "Correct" or "Helpful". Gregg http://thesaffageek.co.uk

firestartah · ‎06-29-2009

If you found these or other information useful, please consider awarding points for "Correct" or "Helpful".

Gregg

If you found this or other information useful, please consider awarding points for "Correct" or "Helpful". Gregg http://thesaffageek.co.uk

DGWS · ‎07-02-2009

Hey GUys,

Thanks for all your responses, i've logged a case with vmware they have advised for some reason it is back tracing.

Key esxFull, not found

InitiateSource , WID = 1167

AdapterServer caught unexpected exception: Invalid state

Backtrace:

eip 0x1111a8e

eip 0xfcf769

eip 0xf80c15

eip 0x6e55c63

eip 0x6e50432

eip 0x6e580f4

eip 0x697eb94

eip 0xa07147

eip 0x6141b7b

eip 0x6141837

eip 0x614824a

eip 0x1121152

eip 0x111b562

eip 0x111ff2a

I have confirmed the licence is ok and i can see the traffic flwoing thru the firewall. If i keep trying i'm able to vmotion the boxes how ever it takes a few goes at it.

Thanks

All

Vmotion failing at 10%