Hi all,
After running for a long time without any problems. Some vms started to hang.
They're not reacting on anyting. The CPU usage is 0 Hz and if you try to take over the vmware console of the vm, the connection will be interupted.
So I started to dig in the logs of the virtual machine, and found a lot of log entries like this:
Log for VMware ESX version=6.7.0 build=build-14320388
2019-12-07T14:14:01.622Z| vcpu-2| W115: Memory regions (0xfc000000, 0xfcfff000) and (0xfc810000, 0xfc81f000) overlap (0x54f0024000 0x5520026000).vcs = 0xfff, vpcuId = 0xffffffff
2019-12-07T14:14:01.627Z| vcpu-0| W115: Memory regions (0xfc000000, 0xfcfff000) and (0xfc810000, 0xfc81f000) overlap (0x54f0024000 0x5520026000).vcs = 0xfff, vpcuId = 0xffffffff
2019-12-07T14:14:01.630Z| vcpu-0| W115: Memory regions (0xfc000000, 0xfcfff000) and (0xfc810000, 0xfc81f000) overlap (0x54f0024000 0x5520026000).vcs = 0xfff, vpcuId = 0xffffffff
2019-12-07T14:14:01.634Z| vcpu-4| W115: Memory regions (0xfc000000, 0xfcfff000) and (0xfc810000, 0xfc81f000) overlap (0x54f0024000 0x5520026000).vcs = 0xfff, vpcuId = 0xffffffff
2019-12-07T14:14:01.637Z| vcpu-0| W115: Memory regions (0xfc000000, 0xfcfff000) and (0xfc810000, 0xfc81f000) overlap (0x54f0024000 0x5520026000).vcs = 0xfff, vpcuId = 0xffffffff
2019-12-07T14:14:01.639Z| vcpu-0| W115: Memory regions (0xfc000000, 0xfcfff000) and (0xfc810000, 0xfc81f000) overlap (0x54f0024000 0x5520026000).vcs = 0xfff, vpcuId = 0xffffffff
2019-12-07T14:14:01.643Z| vcpu-3| W115: Memory regions (0xfc000000, 0xfcfff000) and (0xfc810000, 0xfc81f000) overlap (0x54f0024000 0x5520026000).vcs = 0xfff, vpcuId = 0xffffffff
2019-12-07T14:14:01.647Z| vcpu-2| W115: Memory regions (0xfc000000, 0xfcfff000) and (0xfc810000, 0xfc81f000) overlap (0x54f0024000 0x5520026000).vcs = 0xfff, vpcuId = 0xffffffff
2019-12-07T14:14:01.650Z| vcpu-0| W115: Memory regions (0xfc000000, 0xfcfff000) and (0xfc810000, 0xfc81f000) overlap (0x54f0024000 0x5520026000).vcs = 0xfff, vpcuId = 0xffffffff
2019-12-07T14:14:01.653Z| vcpu-2| W115: Memory regions (0xfc000000, 0xfcfff000) and (0xfc810000, 0xfc81f000) overlap (0x54f0024000 0x5520026000).vcs = 0xfff, vpcuId = 0xffffffff
2019-12-07T14:14:01.657Z| vcpu-2| W115: Memory regions (0xfc000000, 0xfcfff000) and (0xfc810000, 0xfc81f000) overlap (0x54f0024000 0x5520026000).vcs = 0xfff, vpcuId = 0xffffffff
2019-12-07T14:14:01.660Z| vcpu-2| W115: Memory regions (0xfc000000, 0xfcfff000) and (0xfc810000, 0xfc81f000) overlap (0x54f0024000 0x5520026000).vcs = 0xfff, vpcuId = 0xffffffff
2019-12-07T14:14:01.662Z| vcpu-2| W115: Memory regions (0xfc000000, 0xfcfff000) and (0xfc810000, 0xfc81f000) overlap (0x54f0024000 0x5520026000).vcs = 0xfff, vpcuId = 0xffffffff
2019-12-07T14:14:01.668Z| vcpu-2| W115: Memory regions (0xfc000000, 0xfcfff000) and (0xfc810000, 0xfc81f000) overlap (0x54f0024000 0x5520026000).vcs = 0xfff, vpcuId = 0xffffffff
2019-12-07T14:14:01.673Z| vcpu-2| W115: Memory regions (0xfc000000, 0xfcfff000) and (0xfc810000, 0xfc81f000) overlap (0x54f0024000 0x5520026000).vcs = 0xfff, vpcuId = 0xffffffff
2019-12-07T14:14:01.679Z| vcpu-4| W115: Memory regions (0xfc000000, 0xfcfff000) and (0xfc810000, 0xfc81f000) overlap (0x54f0024000 0x5520026000).vcs = 0xfff, vpcuId = 0xffffffff
And after this:
2019-12-07T14:14:01.679Z| vcpu-4| E105: PANIC: PhysMem: creating too many Global lookups.
2019-12-07T14:14:08.634Z| vcpu-4| W115: A core file is available in "/vmfs/volumes/5cdd51ee-fd4310f2-58c4-24b6fd652bce/0-pg-virtgpu008/vmx-zdump.000"
2019-12-07T14:14:08.634Z| mks| W115: Panic in progress... ungrabbing
2019-12-07T14:14:08.634Z| mks| I125: MKS: Release starting (Panic)
2019-12-07T14:14:08.634Z| mks| I125: MKS: Release finished (Panic)
2019-12-07T14:14:08.643Z| vcpu-4| I125: Writing monitor file `vmmcores.gz`
2019-12-07T14:14:08.722Z| vcpu-4| W115: Dumping core for vcpu-0
2019-12-07T14:14:08.722Z| vcpu-4| I125: VMK Stack for vcpu 0 is at 0x451ae7c93000
2019-12-07T14:14:08.722Z| vcpu-4| I125: Beginning monitor coredump
2019-12-07T14:14:09.115Z| vcpu-4| I125: End monitor coredump
2019-12-07T14:14:09.116Z| vcpu-4| W115: Dumping core for vcpu-1
2019-12-07T14:14:09.116Z| vcpu-4| I125: VMK Stack for vcpu 1 is at 0x451af3b13000
2019-12-07T14:14:09.116Z| vcpu-4| I125: Beginning monitor coredump
2019-12-07T14:14:09.510Z| vcpu-4| I125: End monitor coredump
2019-12-07T14:14:09.510Z| vcpu-4| W115: Dumping core for vcpu-2
2019-12-07T14:14:09.510Z| vcpu-4| I125: VMK Stack for vcpu 2 is at 0x451aebb13000
2019-12-07T14:14:09.510Z| vcpu-4| I125: Beginning monitor coredump
2019-12-07T14:14:09.904Z| vcpu-4| I125: End monitor coredump
2019-12-07T14:14:09.905Z| vcpu-4| W115: Dumping core for vcpu-3
2019-12-07T14:14:09.905Z| vcpu-4| I125: VMK Stack for vcpu 3 is at 0x451aeb713000
2019-12-07T14:14:09.905Z| vcpu-4| I125: Beginning monitor coredump
2019-12-07T14:14:10.300Z| vcpu-4| I125: End monitor coredump
2019-12-07T14:14:10.300Z| vcpu-4| W115: Dumping core for vcpu-4
2019-12-07T14:14:10.300Z| vcpu-4| I125: VMK Stack for vcpu 4 is at 0x451affd13000
2019-12-07T14:14:10.300Z| vcpu-4| I125: Beginning monitor coredump
2019-12-07T14:14:10.692Z| vcpu-4| I125: End monitor coredump
2019-12-07T14:14:10.693Z| vcpu-4| W115: Dumping core for vcpu-5
2019-12-07T14:14:10.693Z| vcpu-4| I125: VMK Stack for vcpu 5 is at 0x451af0313000
2019-12-07T14:14:10.693Z| vcpu-4| I125: Beginning monitor coredump
2019-12-07T14:14:11.085Z| vcpu-4| I125: End monitor coredump
2019-12-07T14:14:11.085Z| vcpu-4| W115: Dumping core for vcpu-6
2019-12-07T14:14:11.085Z| vcpu-4| I125: VMK Stack for vcpu 6 is at 0x451afcb13000
2019-12-07T14:14:11.085Z| vcpu-4| I125: Beginning monitor coredump
2019-12-07T14:14:11.474Z| vcpu-4| I125: End monitor coredump
2019-12-07T14:14:11.474Z| vcpu-4| W115: Dumping core for vcpu-7
2019-12-07T14:14:11.474Z| vcpu-4| I125: VMK Stack for vcpu 7 is at 0x451af2813000
2019-12-07T14:14:11.474Z| vcpu-4| I125: Beginning monitor coredump
2019-12-07T14:14:11.941Z| vcpu-4| I125: End monitor coredump
2019-12-07T14:14:11.941Z| vcpu-4| W115: Dumping core for vcpu-8
2019-12-07T14:14:11.941Z| vcpu-4| I125: VMK Stack for vcpu 8 is at 0x451af4913000
2019-12-07T14:14:11.941Z| vcpu-4| I125: Beginning monitor coredump
2019-12-07T14:14:12.334Z| vcpu-4| I125: End monitor coredump
2019-12-07T14:14:12.334Z| vcpu-4| W115: Dumping core for vcpu-9
2019-12-07T14:14:12.335Z| vcpu-4| I125: VMK Stack for vcpu 9 is at 0x451ae7293000
2019-12-07T14:14:12.335Z| vcpu-4| I125: Beginning monitor coredump
2019-12-07T14:14:12.728Z| vcpu-4| I125: End monitor coredump
2019-12-07T14:14:12.728Z| vcpu-4| W115: Dumping core for vcpu-10
2019-12-07T14:14:12.728Z| vcpu-4| I125: VMK Stack for vcpu a is at 0x451af4f93000
2019-12-07T14:14:12.728Z| vcpu-4| I125: Beginning monitor coredump
2019-12-07T14:14:13.121Z| vcpu-4| I125: End monitor coredump
2019-12-07T14:14:13.122Z| vcpu-4| W115: Dumping core for vcpu-11
2019-12-07T14:14:13.122Z| vcpu-4| I125: VMK Stack for vcpu b is at 0x451ae8913000
2019-12-07T14:14:13.122Z| vcpu-4| I125: Beginning monitor coredump
2019-12-07T14:14:13.514Z| vcpu-4| I125: End monitor coredump
2019-12-07T14:14:34.966Z| vcpu-4| I125: Printing loaded objects
So the vms has crached, and it sort of looks like memory related.
I have more vms like this.
Anyone any idea ??
Thanks!!
Hi Blaze4up,
Did you get a fix/response for this problem.?
We are experiencing exactly the same problem
ESXi 6.7 U3 Host with 2x Tesla V100 GPUs
NVIDIA GRID Host Driver 10.3
VMs with vGPUs are randomly failing/crashing due to Memory region overlap errors.
VM's on the same hosts without vGPUs seems to be running fine.
What type of firmware is your virtual machine configured to use – BIOS or [U]EFI? Our EFI implementation is much better at handling passthrough devices with large MMIO regions such as GPUs.
If you are currently using BIOS, though, it might be worth trying to set up a VM with EFI firmware instead. Unfortunately most OSes can't simply be switched from BIOS boot to EFI boot without reinstalling the OS. 😞
--
Darius
Hi Darius,
Thanks for you reaction! And sorry for my late one. apperntly I hanv't been notified about the reaction.
But, the vms are BIOS and running Centos7. The GPU,s we use with vGPU and the largest profile 32Gb profile.
We just currently run Nvidia GRID 10.0, so I will see if the issue still appears.
If it does, do you suggest the try and use EFI?
thanks, Gemma
Hi,
No didn't got a fix yet.
We have one GPU each host..
Indeed, vms without vgpu are running fine luckily
If you have a fix, and you would like to share it, I would appreciate it much!
Thanks, Gemma
If you have the time to create a VM with EFI firmware and install the OS into that, it might be a worthwhile experiment.
--
Darius