VMware Cloud Community
GeoPerkins
Enthusiast
Enthusiast

EVC Mode (ESXi vSphere 6.5) Issues

I have had some 'strange' symptoms after a planned data center power outage (all ESXi hosts had to be powered off).

Clusters are set with EVC Westmere (L3) or Sandy-Bridge (L4). ESXi 6.5 at several different build levels.

When some hosts joined their cluster after power up they seemed to not obtain the cluster's EVC mode and acted as if they were EVC disabled. Consequently those VMs that powered up on the rogue hosts self-selected to the highest available processor features like Haswell (L6). Then we could not vMotion the VMs anywhere and had to schedule power-off migrations to reduce the VMs back to Westmere (L3) or Sandy-Bridge (L4). Needless to say, having to schedule a power off was disruptive. The host was removed from the cluster, restarted and added back to the cluster and became constrained by the cluster's EVC setting.

To simplify I want to set all clusters to Sandy-Bridge (L4) but I am concerned that some VMs are operating at Haswell (L6) in "stealth" mode and I won't see this until I get partially through the process. I know they will need to be powered off to complete the reduction to Sandy-Bridge L4.

Furthermore, I have read other posts that seem to implicate the Spectre/Meltdown patches as making EVC mode invalid as a true test of compatiblity. How do I report the status of my CPU compatibility if EVC mode by itself is not a true measure? This makes me uncertain and nervous. Prior to the data center power restart vMotion was working as expected. Now, we have problems.

Please continue for questions at bottom.

When I view the vmware.log after power on, I see:

vmware.log

2019-08-24T14:00:20.312Z| vmx| I125: FeatureCompat: EVC masks:

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.STEPPING - Val:0

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.Intel - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.RDRAND - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.FCMD - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.MODEL - Val:0x3a

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.XSAVE - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.LM - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.NUM_EXT_LEVELS - Val:0x80000008

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.ENFSTRG - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.MWAIT - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.FAMILY - Val:6

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.VMX - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.XCR0_MASTER_YMM_H - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: misc.cpuidFaulting - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.PCID - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.SSBD - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.SSSE3 - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.SSE3 - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.NX - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.SSE41 - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.AES - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.STIBP - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.PCLMULQDQ - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.SS - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.POPCNT - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.AVX - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: vt.realmode - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.F16C - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.FSGSBASE - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.DS - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.RDTSCP - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.LAHF64 - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.IBPB - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: hv.capable - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.CMPXCHG16B - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.SMEP - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.SSE42 - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.XCR0_MASTER_SSE - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.IBRS - Val:1

2019-08-24T14:00:20.312Z| vmx| I125: Masking Feature: cpuid.NUMLEVELS - Val:0xd

2019-08-24T14:00:20.312Z| vmx| I125: Masking Unknown Feature: Max:0

2019-08-24T14:00:20.315Z| vmx| I125: hostCPUID vendor: GenuineIntel

2019-08-24T14:00:20.315Z| vmx| I125: hostCPUID family: 0x6 model: 0x3f stepping: 0x2

2019-08-24T14:00:20.315Z| vmx| I125: hostCPUID codename: Haswell EP/EN/EX

2019-08-24T14:00:20.315Z| vmx| I125: hostCPUID name: Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz

That VM was started on a host in a cluster set to EVC Ivy Bridge (L5) - but the log says the running VM is at Haswell (L6).

When I use vCenter and view the Cluster configuration, the EVC mode is set to Ivy-Bridge (L5). See screenshot.

pastedImage_5.png

When I use vCenter and view the host's VMs (after making the EVC mode column visible), the VM is shown as Sandy-Bridge (L4). See screenshot.

pastedImage_6.png

Three different reported EVC mode! Obviously, the results do not agree.

Questions:

1. What is the truth? How can I verify the actual EVC setting of the VM?

2. To avoid Spectre/Meltdown EVC issues, what updates on all hosts and all firmware so I avoid problems with completing vMotion? Will this also fix the problems with EVC?

3 Replies
vXav
Expert
Expert

What you are seeing in the UI is correct.

The EVC level is defined when the VM boots. So most likely those VMs last booted on hosts with these CPUs.

The log says "hostCPUID codename: Haswell EP/EN/EX". I doesn't say that the VM vCPU is Haswell. Sandy Bridge being older than Haswell (host CPU) and older Ivy Bridge (EVC level), the lowest level is used. "Older CPU VMs" are compatible with newer CPU hosts, not the other way around (if that makes sense...).

The "post-meltdown" builds present CPU instructions that were not in pre-meltdown builds. So enabling EVC kind of hides this behaviour.

When I upgraded my hosts from a pre-meltdown to a post-meltdown patch, I had to enable EVC to avoid VMs booting on patched hosts that couldn't be moved to non-patched hosts as the patching was over several days.

Shrikant_Gavhan
Enthusiast
Enthusiast

Below are answer to your queries.

1. What is the truth? How can I verify the actual EVC setting of the VM?

Ans -EVC settings you see in the cluster are the real ones.

          The logs that you see in VM.log will update the EVC mode post VM reboots. this article explains it Configure EVC on vSphere 6.5

2. To avoid Spectre/Meltdown EVC issues, what updates on all hosts and all firmware so I avoid problems with completing vMotion? Will this also fix the problems with EVC?

Ans - please check below link. (VMware page routes to a blog mentioned here.)

VMware link : Vulnerabilities – How to Fix Meltdown and Spectre on VMware vSphere ? - VMware Blogs - VMware Blogs

Blog link: https://www.unixarena.com/2018/01/vulnerabilities-fix-meltdown-spectre-vmware-vsphere.html

pastedImage_3.png

Thanks and Regards, Shrikant Gavhane
0 Kudos
GeoPerkins
Enthusiast
Enthusiast

Thanks Shrikant for your reply, but I still can't find any relevant information in any of VMware's KBs or blogs regarding the impact to EVC from the Spectre/Meltdown patches. Your links did give me momentum to find what appear to be the final word from VMware on Spectre/Meltdown:  https://kb.vmware.com/s/article/55806 and https://www.vmware.com/techpapers/2018/scheduler-options-vsphere67u2-perf.html But these articles by themselves do not provide guidance to troubleshoot error such as:

Example Error

The virtual machine requires hardware features that are unsupported or disabled on the target host:

"""""""""""""* General incompatibilities"

If possible, use a cluster with Enhanced vMotion Compatibility (EVC) enabled; see KB article 1003212.

CPUID details: incompatibility at level 0x1 register 'ecx'.

Host bits: 0000:0010:1001:1000:0010:0010:0000:0011

Required:  x001:x11x:10x1:1xx0:xx10:xx1x:xx0x:xx11

So is there a way to report the VM guest's required bits and also report the hosts' advertised bits?  It seems as if the GUI and vmware.log don't give these bits. I can't seem to find a esxcli or PowerCLI command to report these. Not sure if the bits are part of the Spectre/Meltdown or if they are somewhere else misconfigured in vSphere EVC mode settings. This is a head-scratcher.

0 Kudos