CristianTeica
Contributor
Contributor

VM migration (vMotion) failing for Windows Cluster VM’s that sharing RDM disks – ESXi 6.7.0 Update 3

Hi everyone.

Hope all of you are ok and safe.

I need to bring up this issue that I have discovered on my upgraded ESXi cluster.

I have recently upgraded all ESXi hosts (6 hosts, PowerEdge R630) plus the VCSA server from ESX 6.0 U3 to ESXi 6.7.0 Update 3 - Build 17700523. Everything was performed by upgrade and I did not had to do any clean install on any server.

Now, the issues:

When I am trying to migrate a VM part of my Windows cluster with vmotion, the task will be stuck on 20% and eventually will fail after 30min with the error message "A general system error occurred: Invalid fault". During this period, the ESXi host from where the VM will be migrated will become irresponsible to my monitoring triggers but all other VM’s running on it are ok. If I shut down the VM the migration task will be succeeded but again it will take 30 min for the task to finish and another 30 min to start it. Basically any time I shutdown or start any cluster VM it will take 30 for each action.

My Windows cluster have 2 VM’s. The cluster is sharing some Physical LUN’s from the EMC storage (Model CT-SC4020) added as RDM disks to both VM’s part my Windows cluster.

The OS on both VM’s is Windows server 2016 Core.

The good news is that I can still migrate (vmotion) all other VM’s windows and linux without issues.

The issue appears only for the VM’s that are sharing Physical LUN’s / RDM disks.

It looks like is something related to the RDM Disks that are shared between windows cluster VM’s, connectivity performance or EMC storage compatibility.

Things that I tried:

- I have upgraded the VM hardware and VMTools client on the VM so will match with ESXi host, also tried to Unregister the VM from Inventory and re-register it on a different Host but unfortunately it didn't solve the issue.

- I was able to reproduce the issue by deploying 2 more VM's from the same template and create a new cluster. I have created new Physical LUN’s also and attached them to the new cluster. I have noticed, I couldn't take VM Snapshot while VM was running and this was before adding the RDM disks to new Windows VM's.

On the old version ESX 6.0 U3 the migration (vMotion) worked for all VM's with no exception.

I would appreciate if anyone could help or if somebody had a similar issue and he is able to explain what need to be done.

Let me know if you need anything else,

Thanks in advance,

Nelu.

I added below the error from the migration Task and some logs from hostd.log (from Host).

Error message in Vcenter migration Task:

Relocate virtual machine

Status: A general system error occurred: Invalid fault

Initiator: VSPHERE.LOCAL\Administrator

Target: test-clu-2

Server: VCENTER

Related events: 05/12/2021, 5:42:03 AM

Cannot migrate test-clu-2 from ESX3, Volume1_SSD to ESX2, Volume1_SSD in DataCenter1

05/12/2021, 5:25:59 AM

Hot migrating test-clu-2 from ESX3, Volume1_SSD in DataCenter1 to ESX2, Volume1_SSD in DataCenter1 with encryption

05/12/2021, 5:25:59 AM

Task: Relocate virtual machine

 

 

Error message in hostd.log:

2021-05-11T12:36:43.120Z info hostd[2099715] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5xxxxx0fb33-09xxxxea-bxxxx-f4xxxxxxxb90/test-clu-2/test-clu-2.vmx] VigorMigrateNotifyCb:: hostlog state changed from emigrating to failure

2021-05-11T12:36:43.121Z info hostd[2099715] [Originator@6876 sub=Vcsvc.VMotionSrc.503673809652417251] ResolveCb: VMX reports needsUnregister = false for migrateType MIGRATE_TYPE_VMOTION

2021-05-11T12:36:43.121Z info hostd[2099715] [Originator@6876 sub=Vcsvc.VMotionSrc.503673809652417251] ResolveCb: Failed with fault: (vim.fault.GenericVmConfigFault) {

-->    faultCause = (vmodl.MethodFault) null,

-->    faultMessage = (vmodl.LocalizableMessage) [

-->       (vmodl.LocalizableMessage) {

-->          key = "msg.migrate.expired",

-->          arg = <unset>,

-->          message = "Timed out waiting for migration start request.

--> "

-->       }

-->    ],

-->    reason = "Timed out waiting for migration start request.

--> "

-->    msg = "Timed out waiting for migration start request.

--> "

--> }

12 Replies
BoPesala
Contributor
Contributor

Hi,

At a quick look at your message I can say that you should check If the destination cluster/host has the same protogroup.

 

Tags (1)
0 Kudos
CristianTeica
Contributor
Contributor

Hi BoPesala,

What to you mean by "same protogroup" ?

Thanks.

0 Kudos
BoPesala
Contributor
Contributor

Sorry for the typo. I am referring to the portgroup of the vSwitch. 

When you are migrating from a cluster to another or inside a cluster in between hosts make sure the VM network is accessible from destination. Hence, the portgroup is available. 
Thanks

0 Kudos
CristianTeica
Contributor
Contributor

The portgroups are ok.

For example a VM part of windows cluster is using 2 NIC's on different portgroups and these portgroups are available (with same name) at Destination host.

Is this is what you mean?

If I shut down the VM I can migrate it but it will take long time to finish.

The migration task with the option of "Change the compute resource only" has "Compatibility checks succeeded" but still failing at 20% only when using Vmotion (VM running).

What else can be ?

Thanks.

0 Kudos
BoPesala
Contributor
Contributor

Hi,

OK. VMotion a shutdown VM should not take a long time unlike storage vMotion since no Storage changes happen and what happen here is only the OS loaded to the RAM sync to destination host. A shutdown VM dose not have RAM data so it should be very fast. Since this operation anyway succeed you should have a working management network. However check following.  

Check the vmkernal adapter configuration. This is responsible for the vMotion traffic.

Do you have a separate PortGroup for Management traffic? ( Ideally you should)
This PortGroup should have the vmkernal Adapter configured. Should look like bellow. 

BoPesala_1-1620913689169.png

When you click on the 3 dots next to vmkX ( its  vmk0 on my screen shot) you should see the port settings. Here you need to enable vmotion and management network traffic.

BoPesala_2-1620913785424.png

 

 

 

0 Kudos
CristianTeica
Contributor
Contributor

Yes, I'm using dedicated vmkernel adaptors, 1 for management and 1 for vMotion.

Network communication is also tested pinging the vkernel IP's from 1 host to another (from console).

Thanks.

0 Kudos
CristianTeica
Contributor
Contributor

If vmotion will not work at all I could understand, that is why I think is related to the shared disks.

When 1 VM is running, and I want to start the second VM it will take time to start and this is not normal, the VM should start immediately.

Is like the shared disks becomes locked or something when one of the machines is using them.

Any other suggestions ?

Thank you.

 

0 Kudos
SubathraL
Contributor
Contributor

Hi Christian Tesca,

 

Am having same issues exactly as you have. After i upgraded to 17700523. My VMs part of windows cluster using physical RDM failing with live migration. I get the same error as you get. As you said, cold migration (shutdown VM) also taking close to 45 minutes to complete and Power on also taking some time.. I had no issues with these VMs so far. AM pretty sure its after 17700523. Have you found any solution yet.. I still have the issue

0 Kudos
flucidi
Contributor
Contributor

Hi

i am having the same issue. After i upgraded to 17700523 the VMs part of windows cluster using physical RDM failing with live migration. Same behavior when you to move them powered off or when they need really much time to be powered on. The issue happened after the 17700523. Have anyone found a solution or a workaround for this issue?

Regards

Fabrizio

0 Kudos
SubathraL
Contributor
Contributor

Hi Flucidi,

I happened to come across the below article and checked with VMware if it can resolve the issue, They said yes after internally checking. However I haven't tried this in my environment yet.

https://kb.vmware.com/s/article/84347?lang=en_US

If you can test this in your test environment, Please let us know about the results as well.
I do not have a test environment which is similar to prod :(...

0 Kudos
flucidi
Contributor
Contributor

Hi  SubathraL
After some research, i was able to find the same KB.

i applied the suggested configuration into our Production environment and that solved the issue.

Now we are able to vMotion the cluster members without any issues.

0 Kudos
SubathraL
Contributor
Contributor

Hi Flucidi,

 

When you applied the configuration, You did online or moved out all the virtual machines from ESXi Host and shutdown the Cluster member? Can please confirm?

0 Kudos