VMware Cloud Community
CrashDodson
Contributor
Contributor

Compute vmotion failing on esxi 6.7 update 1

For a little background we have been using esxi with the same storage vendor, coraid, since 5.5.  Coraid uses ATA over Ethernet to present the storage to vmware.  Its extremely fast and simple.  It uses coraid HBA cards, which are Intel 10G nics with coraid firmware on them and their driver loaded in ESXi. 

We never had any problems until going to ESXI 6.5.  Since 6.5 compute vmotions will fail.  They hang at 18 or 20% until it times out.  If I do a compute + storage vmotion to a different lun on the same piece of storage, or a different piece of storage (all coraid) the vmotion works.  When doing compute only vmotions they fail. I have tested this with the vmotion network going through a switch and direct connected.  I have tried it at MTU 1500 and MTU 9000 (adjusting switch settings accordingly).  I can vmkping the vmotion interfaces without any issues and with sub ms latency including jumbo packets.

We see the exact same issues with 6.5, 6.7 and 6.7 update 1. 

We were using dell R715 and R815 servers which are no longer supported on the hardware compatibility list for 6.5 and 6.7.  We bought new dell R7425 servers with Epyc processors and Mellanox MT27710 network cards.  Connecting these servers to the same storage (only SAN's I own) and we have the exact same issue.  Compute vmotions fail.  Compute + storage vmotions are successful. 

> >>> Description:

> >>> 02/13/2019, 2:35:28 PM

> >>> Alarm 'Migration error' on Test triggered by event 1454851 'Cannot migrate Test from epyc-2, 200.1 to epyc-1, 200.1 in  Vsphere6'

> >>> Related events:

> >>> There are no related events.

> >>>

> >>> 02/13/2019, 2:35:28 PM

> >>> User:

> >>> VSPHERE.LOCAL\Administrator

> >>> Test

> >>> Description:

> >>> 02/13/2019, 2:35:28 PM

> >>> Cannot migrate Test from epyc-2 200.1 to

> >>> epyc-1, 200.1 in Vsphere6 Event Type Description:

> >>> Failed to migrate the virtual machine for reasons described in the

> >>> event message Possible Causes:

> >>> The virtual machine did not migrate. This condition can occur if vMotion IPs are not configured, the source and destination hosts are not accessible, and so on. Action: Check the reason in the event message to find the cause of the failure. Ensure that the vMotion IPs are configured on source and destination hosts, the hosts are accessible, and so on.

Host Epyc-1 has a vmotion ip of 192.168.254.239

Host  Epyc-2 has a vmotion ip of 192.168.254.240

> >>> [root@EPYC-1:~] vmkping -d -S vmotion -s 8972 192.168.254.240 PING

> >>> 192.168.254.240 (192.168.254.240): 8972 data bytes

> >>> 8980 bytes from 192.168.254.240: icmp_seq=0 ttl=64 time=0.378 ms

> >>> 8980 bytes from 192.168.254.240: icmp_seq=1 ttl=64 time=0.289 ms

> >>>

> >>> [root@EPYC-2:~] vmkping -d -S vmotion -s 8972 192.168.254.239 PING

> >>> 192.168.254.239 (192.168.254.239): 8972 data bytes

> >>> 8980 bytes from 192.168.254.239: icmp_seq=0 ttl=64 time=0.454 ms

> >>> 8980 bytes from 192.168.254.239: icmp_seq=1 ttl=64 time=0.317 ms

I have had a case opened with vmware, going on the second day now.  They are still "analyzing logs".  I opened a case previously on my r715 and r815 servers but they would not help because they were not supported hardware. 

To me it seems like it has to be storage related, possibly an issue with the coraid hba drivers.  Almost like the losing host has a lock on the storage and wont release it and the vmotion finally times out and fails. 

Anyone have any ideas?

0 Kudos
2 Replies
cymotorola
Contributor
Contributor

Did you ever get to the bottom of this? I have the same issue and at the moment. I suspect it is down to some DRS rules that appear to have been automatically created by SimpliVity.

0 Kudos
Delta816
Contributor
Contributor

I have a couple of old coraid boxes, as will srxs.. Do you know where to get drivers to update them for VM 6.7?

 

0 Kudos