We have the HA Cluster of two IBM x3550 M4 hosts with Intel Xeon CPU E5-2650 v2 and two LENOVO x3550 M5 hosts with Intel Xeon CPU E5-2650 v4 (ESXi version 6.0.0, 8934903), EVC Mode on (Intel® "Ivy Bridge" Generation)
Over 100 VMs are hosted on the Cluster, but there is an issue with one on them.
Every week at the same time the VM crushes with an error vcpu-0| W115: MONITOR PANIC: vcpu-1:VERIFY vmcore/vmm/platform/common/platform.c:30 bugNr=17332
Guest OS is SUSE Linux Enterprise 11 (64-bit), Compatibility: ESXi 5.5 and later (VM version 10)
There are no useful information in vmware.log (zipped log files are attached)
All memory being accessed by VM is local (see screenshot of esxtop with NUMA stats)
We tried to relocate VM on the different host, but it was helpless.
Hello BB_IT,
You have a mem affinity set for VCSA VM? The VCSA VM is around 98GB and 8vCPUS but am not sure why someone set this affinity.
sched.mem.affinity = "1"
Can you remove this and check if the VCSA is stable or can you let me know why this is setup in first place?
Thanks,
MS
Hello BB_IT,
You have a mem affinity set for VCSA VM? The VCSA VM is around 98GB and 8vCPUS but am not sure why someone set this affinity.
sched.mem.affinity = "1"
Can you remove this and check if the VCSA is stable or can you let me know why this is setup in first place?
Thanks,
MS
Thank you for answer!
The VM was allocated to specific NUMA Node 1 and there was not enough RAM on that Node.
The VM configuration parameters was:
sched.cpu.affinity = "16-31"
sched.mem.affinity = "1"
We cleared affinity parameters and problem solved.