Raudi
Expert
Expert

vMotion issue when VM on a VVOL Datastore

Hi,

i installed im my lab with a vSphere 7.0.1 enviroment the NetApp VSC 9.7.1 wich brings the VASA provider for VVOLs to thest them on my NetApp Storage connected via NFS.

Installation works without issues and even the migration from the NFS 4.1 datastores to the VVOL datastore.

Then i noticed a problem with the VCSA: VCSA on a VVOL? Is this supported?

After a little bit more testing i noticed that this is a general vMotion problem when the VM is on a VVOL datastore. When the same VM is on a FS 4.1 datastore i have no issue.

I opened a case first at NetApp, but they didin't found something in the logs and told me to open a case at vmware too. There i opened one week ago a case and provided all logs. But they took a very long time to decide which department should work on the case. Yesterday i talked first time with the support and today we finished the tests to be clear, yes the problem is only during a normal vMotion and only when the VM is on a VVOL. So he will forward the case to the storage guys.

While i'm waiting for the support, i think i write down here my issue perhaps someone esle had a similar problem...

Problem:

vMotion from one "host" to a other "host" stuck at 85% and then VM freezes for about 30 sec, then the vMotion continues and finishes. Then the VM is running again.

Here i have some log parts:

Log of the VM from source-host:

2020-10-14T15:52:14.200Z| vmx| W003: VMX has left the building: 0.

VMKernel from source-host:

2020-10-14T15:52:14.251Z cpu4:2105329)VVol: VVolRemoveDev:7163: Unlinking (VVOL_OBJTYPE_VMDK) VVol device rfc4122.80207299-548e-459c-bc0c-4d45318cfae2

2020-10-14T15:52:14.332Z cpu18:2099869)VVol: VVolRemoveDev:7163: Unlinking (VVOL_OBJTYPE_CONFIG) VVol device rfc4122.1edaed3d-4db9-44d6-a945-79567334ffa0

The VM has left the host at 17:52:14, so it must be started in the same sec on the destination...

Log of the VM from the destination-host:

2020-10-14T15:52:14.190Z| vcpu-0| I005: Transitioned vmx/execState/val to poweredOn

2020-10-14T15:52:14.191Z| vcpu-0| I005: MigrateSetState: Transitioning from state 12 to 0.

2020-10-14T15:52:54.205Z| vmx| I005: DiskUpgradeMultiwriter: Upgraded open disk 'scsi0:0' from multiwriter.

Here is a large gap between the sec 14 and 54 in the log, there is no message.

VMKernel from the destination-host:

2020-10-14T15:52:12.956Z cpu3:2103898)VVol: VVolMakeDev:6740: Creating a device for rfc4122.1edaed3d-4db9-44d6-a945-79567334ffa0 (Type VVOL_OBJTYPE_CONFIG)

2020-10-14T15:52:13.264Z cpu16:2103911)VVol: VVolMakeDev:6740: Creating a device for rfc4122.80207299-548e-459c-bc0c-4d45318cfae2 (Type VVOL_OBJTYPE_UNKNOWN)

2020-10-14T15:52:14.190Z cpu25:2103920)Hbr: 3731: Migration end received (worldID=2103906) (migrateType=1) (event=1) (isSource=0) (sharedConfig=1)

2020-10-14T15:52:14.191Z cpu8:2103915)VMotion: 3230: 8288837917254555216 😧 VMotion bandwidth in last 1s: 27 MB/s,

2020-10-14T15:52:14.194Z cpu3:2103923)Swap: vm 2103906: 5135: Finish swapping in migration swap file. (faulted 0 pages). Success.

2020-10-14T15:52:44.200Z cpu25:2103905)NFSLock: 3302: lock .lck-1c7bdce900000000 expired: counter prev 584 3fc5805f-1e9c2009-3763-ac1f6bc58788 : curr 584 3fc5805f-1e9c2009-3763-ac1f6bc58788 (loop count 3)

This message i'm wondering about...

Hostd from the destination-host:

2020-10-14T15:52:13.138Z verbose hostd[2099792] [Originator@6876 sub=Vigor.Vmsvc.vm:/vmfs/volumes/vvol:fb1e3913ec4448e4-bf4e00000098990c/rfc4122.1edaed3d-4db9-44d6-a945-79567334ffa0/srv15 - Web-Server.vmx] VMotion destination started; powering on

2020-10-14T15:52:13.213Z info hostd[2100209] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/vvol:fb1e3913ec4448e4-bf4e00000098990c/rfc4122.1edaed3d-4db9-44d6-a945-79567334ffa0/srv15 - Web-Server.vmx] VigorMigrateNotifyCb:: hostlog state changed from emigrating to none

2020-10-14T15:52:54.219Z verbose hostd[2100094] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/vvol:fb1e3913ec4448e4-bf4e00000098990c/rfc4122.1edaed3d-4db9-44d6-a945-79567334ffa0/srv15 - Web-Server.vmx] VMotionStatusCb [8288837917254555216]: Succeeded

2020-10-14T15:52:54.219Z verbose hostd[2100094] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/vvol:fb1e3913ec4448e4-bf4e00000098990c/rfc4122.1edaed3d-4db9-44d6-a945-79567334ffa0/srv15 - Web-Server.vmx] VMotionStatusCb: Firing ResolveCb

2020-10-14T15:52:54.219Z info hostd[2100094] [Originator@6876 sub=Vcsvc.VMotionDst.8288837917254555216] ResolveCb: VMX reports needsUnregister = false for migrateType MIGRATE_TYPE_VMOTION

2020-10-14T15:52:54.219Z info hostd[2100094] [Originator@6876 sub=Vcsvc.VMotionDst.8288837917254555216] ResolveCb: Succeeded

2020-10-14T15:52:54.220Z info hostd[2100094] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/vvol:fb1e3913ec4448e4-bf4e00000098990c/rfc4122.1edaed3d-4db9-44d6-a945-79567334ffa0/srv15 - Web-Server.vmx] Disk access enabled.

2020-10-14T15:52:54.221Z info hostd[2100094] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/vvol:fb1e3913ec4448e4-bf4e00000098990c/rfc4122.1edaed3d-4db9-44d6-a945-79567334ffa0/srv15 - Web-Server.vmx] State Transition (VM_STATE_IMMIGRATING -> VM_STATE_ON)

2020-10-14T15:52:54.225Z info hostd[2100094] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/vvol:fb1e3913ec4448e4-bf4e00000098990c/rfc4122.1edaed3d-4db9-44d6-a945-79567334ffa0/srv15 - Web-Server.vmx] Send config update invoked

Here the same gap. Here i'm wondering abut the message „Disk access enabled“ in the sec 54, why so late?

The main question, what happens between the sec 14 und 54 and how to fix that?

Kind regards

Stefan

Tags (2)
7 Replies
Raudi
Expert
Expert

Still no solution...

NetApp has closed the case some weeks ago, because VMware is to slow, they don't want to wait anny longer. And Now VMware has closed that case too, because they don't found something in my logs and to escalate this case i must first bring the versions i used to the latest versions that are listed in the VMware HCL:

  • ESXi 6.5 U3
  • VASA Provider 7.1P1 (end of support since jan 2019)
  • ONTAP 9.4 (end of support sice jun 2019)

O.k. here has NetApp missed to refresh the HCL infos with the infos of their IMT to VMware, they will do that now, i hope quickly that i can reopen that case...

0 Kudos
Raudi
Expert
Expert

A short update, the VSC 9.7.1p1 has now support for vSphere 7.0.1, regarding the NetApp IMT, but in the VMware HCL the infos are still missing...

And i have still the problem, when a VM is on a VVOL datastore the VM freezes during vMotion for 30  to 40 sec.

So no change and no solution regarding the vMotion problem...

0 Kudos
depping
Leadership
Leadership

With these types of situations unfortunately only support can really help you. You would need indeed to have all solutions on a supported version, and then file an SR. I doubt there's anything anyone here can help with considering the level at which your problem occurs.

0 Kudos
depping
Leadership
Leadership

I would recommend posting your SR Number here, so that any vmware employee can look in to it, or potential escalate it when it is not getting the right traction.

0 Kudos
Raudi
Expert
Expert

Only NetApp can help, they must first send updates to VMware that the VMware HCL shows the same like the NetApp IMT.

Or someone tells the support to ignore the VMware HCL and trust the NetApp IMT.

This was the SR: 20164311910

But was closed by VMware in November...

 

0 Kudos
Raudi
Expert
Expert

I got a notice in the NetApp community that this is caused by the bug 2668244 and a fix should be targeted for ESXi 7.0 U2. There is a problem with NFS file locking that slows down the migration of VMs on VVOL datastores.

So i tested the december beta:

host1 = ESXi 7.0.1 (U1c)
host2 = ESXi 7.0.2 (december beta)

When i'm doing a vmotion from host1 to host2, the vmotion takes only 5 sec.

But when going back from host2 to host1 then it takes 40sec and the VM freezes the most time.

So i was hitting the bug and the bug is fixed in the next update version...

0 Kudos
depping
Leadership
Leadership

Somehow my reply wasn't posted last week. I check the PR and it is indeed supposed to be included in U2. I also see that a hotpatch was requested by another customer for this problem. I don't know what the requirements are for being able to request a hotpatch, but it may be something you could do as well...

0 Kudos