Hey jmattson (or similar brilliant VMware guru):
We have a VM that's failing to resume, despite my current understanding that it should. ESXi build is 2068190 on both hosts.
VM was was started on a Haswell:
2015-03-26T07:02:43.172Z| vmx| I120: FeatureCompat: No EVC masks.
2015-03-26T07:02:43.181Z| vmx| I120: hostCPUID vendor: GenuineIntel
2015-03-26T07:02:43.181Z| vmx| I120: hostCPUID family: 0x6 model: 0x3f stepping: 0x2
2015-03-26T07:02:43.181Z| vmx| I120: hostCPUID codename: Haswell EP/EN/EX
2015-03-26T07:02:43.181Z| vmx| I120: hostCPUID name: Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
With these CPUID mask in the VMX (should match exactly what the UI reports as the "Westmere" mask):
2015-03-26T07:02:43.363Z| vmx| I120: DICT cpuid.1.eax = 00000000000000100000011001010001
2015-03-26T07:02:43.363Z| vmx| I120: DICT cpuid.1.ecx = 00000010100110001110001000111111
2015-03-26T07:02:43.363Z| vmx| I120: DICT cpuid.1.edx = 10001111111010111111101111111111
2015-03-26T07:02:43.363Z| vmx| I120: DICT cpuid.80000001.ecx = 00000000000000000000000000000001
2015-03-26T07:02:43.363Z| vmx| I120: DICT cpuid.80000001.edx = 00101000000100000000100000000000
2015-03-26T07:02:43.363Z| vmx| I120: DICT cpuid.d.eax = 00000000000000000000000000000000
2015-03-26T07:02:43.363Z| vmx| I120: DICT cpuid.d.ecx = 00000000000000000000000000000000
2015-03-26T07:02:43.363Z| vmx| I120: DICT cpuid.d.edx = 00000000000000000000000000000000
2015-03-26T07:02:43.363Z| vmx| I120: DICT checkpoint.disableCpuCheck = true
Now.... We then try to resume the VM on a Sandybridge (same CPUID masking in effect):
2015-03-26T08:14:47.044Z| vmx| I120: FeatureCompat: No EVC masks.
2015-03-26T08:14:47.044Z| vmx| I120: hostCPUID vendor: GenuineIntel
2015-03-26T08:14:47.044Z| vmx| I120: hostCPUID family: 0x6 model: 0x2c stepping: 0x2
2015-03-26T08:14:47.044Z| vmx| I120: hostCPUID codename: Westmere EP
2015-03-26T08:14:47.044Z| vmx| I120: hostCPUID name: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
And if fails to run:
2015-03-26T08:14:47.619Z| vmx| I120: FeatureCompat: No VM masks.
2015-03-26T08:14:47.619Z| vmx| I120: MonPmc: ctrBase 0xc1 selBase 0x186/1 PGC 1/1 SMM 1 drain 1 flush 0
2015-03-26T08:14:47.619Z| vmx| I120+ MonPmc: gen counters num: 4 width 48 write width 32
2015-03-26T08:14:47.619Z| vmx| I120+ MonPmc: fix counters num: 3 width 48
2015-03-26T08:14:47.619Z| vmx| I120+ MonPmc: unavailable counters: 0x600000000
2015-03-26T08:14:47.620Z| vmx| I120: CPT: Restoring checkpoint /vmfs/volumes/21fbd13f-cf16dd67/session-295fb5c4.vmss
2015-03-26T08:14:47.624Z| vmx| I120: DUMPER: Restoring checkpoint version 8.
2015-03-26T08:14:47.635Z| vmx| I120: guestCpuFeatures = 0x2000fd
2015-03-26T08:14:47.635Z| vmx| I120: Msg_Question:
2015-03-26T08:14:47.635Z| vmx| I120: [msg.checkpoint.cpucheck.fail.feature] The features supported by the processors in this machine are different from the features supported by the processors in the machine on which the virtual machine state was saved.
2015-03-26T08:14:47.635Z| vmx| I120: [msg.checkpoint.cpucheck.fail.hard] Resume on a machine with similar processors.
2015-03-26T08:14:47.635Z| vmx| I120: [msg.checkpoint.restore.cpufail] An error occurred while restoring the CPU state from file "/vmfs/volumes/21fbd13f-cf16dd67/session-295fb5c4.vmss".
2015-03-26T08:14:47.635Z| vmx| I120: [msg.checkpoint.resume.softError] Your virtual machine did not resume because of a correctable error. Preserve the suspended state and correct the error, or discard the suspended state.
Based on my understanding (Westmere is a subset of both Sandybridge and Haswell), it should have worked.
Thanks,
Matt
MattPietrek wrote:
Thanks Jim - this was incredibly helpful. Exactly the details I need to make informed choices about how to proceed.
A follow up question, if I may: When a VM fails to resume because of CPU compat, is there some way of knowing exactly what features the .vmss requires that the host can't satisfy?
If you enable debugging for the VM, you should see something like the following in the vmware.log file:
2015-03-27T16:04:57.196-07:00| vmx| I120: FeatureCompat: Capabilities:
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.fixctr.1 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.genctr.6 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.Intel = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.RDRAND = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.PDPE1GB = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.XSAVE = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.fixctr.0 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.LM = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.genctr.7 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.ENFSTRG = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.MWAIT = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.MOVBE = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.numGenCtrs = 0x8
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.fixedWidth = 0x30
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.genctr.4 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.VMX = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.ABM = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.genWidth = 0x30
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.version = 0x3
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.XCR0_MASTER_YMM_H = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.fixctr.2 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.genctr.5 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: misc.cpuidFaulting = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.PCID = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.FMA = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.AVX2 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.XSAVEOPT = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.microarchitecture.haswell = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.RTM = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.genctr.2 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.SSSE3 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.SSE3 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.NX = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.SSE41 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.BMI2 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.AES = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.PCLMULQDQ = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.SS = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.POPCNT = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.AVX = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.genctr.3 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vt.realmode = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.F16C = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.FSGSBASE = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.DS = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.genctr.0 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.RDTSCP = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.LAHF64 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: hv.capable = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.CMPXCHG16B = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.INVPCID = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.SMEP = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.numFixedCtrs = 0x3
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.genctr.1 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.SSE42 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.BMI1 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.HLE = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.XCR0_MASTER_SSE = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: FeatureCompat: Requirements:
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.SSE3 - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.PCLMULQDQ - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.SSSE3 - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.FMA - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.CMPXCHG16B - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.PCID - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.SSE41 - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.SSE42 - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.MOVBE - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.POPCNT - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.AES - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.XSAVE - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.AVX - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.F16C - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.RDRAND - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.DS - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.SS - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.FSGSBASE - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.BMI1 - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.AVX2 - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.SMEP - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.BMI2 - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.ENFSTRG - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.INVPCID - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.XCR0_MASTER_SSE - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.XCR0_MASTER_YMM_H - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.XSAVEOPT - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.LAHF64 - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.ABM - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.NX - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.PDPE1GB - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.RDTSCP - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.LM - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.Intel - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: misc.cpuidFaulting - Bool:Min:1
Anything in the requirements list that's not in the capabilities list is a problem.
Here's the mask for this iteration:
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.0.eax = 00000000000000000000000000001011
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.1.eax = 00000000000000100000011001010001
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.1.ecx = 0000001010011000001000100011-011
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.1.edx = -0001111111010111111101111111111
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000001.ecx = 00000000000000000000000000000001
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000001.edx = 00101000000100000000100000000000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.d.eax = 00000000000000000000000000000000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.d.ecx = 00000000000000000000000000000000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.d.edx = 00000000000000000000000000000000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000002.0.eax = 0110:0101:0111:0100:0110:1110:0100:1001
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000002.0.ebx = 0010:1001:0101:0010:0010:1000:0110:1100
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000002.0.ecx = 0110:1111:0110:0101:0101:1000:0010:0000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000002.0.edx = 0010:1001:0101:0010:0010:1000:0110:1110
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000003.0.eax = 0101:0101:0101:0000:0100:0011:0010:0000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000003.0.ebx = 0010:0000:0010:0000:0010:0000:0010:0000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000003.0.ecx = 0010:0000:0010:0000:0010:0000:0010:0000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000003.0.edx = 0101:1000:0010:0000:0010:0000:0010:0000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000004.0.eax = 0011:0000:0011:0101:0011:0110:0011:0101
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000004.0.ebx = 0010:0000:0100:0000:0010:0000:0010:0000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000004.0.ecx = 0011:0111:0011:0110:0010:1110:0011:0010
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000004.0.edx = 0000:0000:0111:1010:0100:1000:0100:0111
I promise to experiment with the approach you suggested, only masking off post-Westmere and using '-' for everything else. But for now, just need to understand what's happening with our current masking.
Thanks!
Matt
I suspect that you need to mask off leaf 7, as I mentioned towards the end of my last posting.
The Westmere masks that you refer to define host capabilities, not guest capabilities. In general, host capabilities are a superset of guest capabilities.
If you replace all of the 1's in your masks with -'s, I think you'll be okay.
Hey Jim,
I'm a little confused by your reply, and perhaps I'm misunderstanding something fundamental.
Here's exactly what we're putting in the .vmx file:
cpuid.1.eax = "00000000000000100000011001010001"
cpuid.1.ecx = "00000010100110001110001000111111"
cpuid.1.edx = "10001111111010111111101111111111"
cpuid.80000001.ecx = "00000000000000000000000000000001"
cpuid.80000001.edx = "00101000000100000000100000000000"
cpuid.d.eax = "00000000000000000000000000000000"
cpuid.d.ecx = "00000000000000000000000000000000"
cpuid.d.edx = "00000000000000000000000000000000"
Now, in a prior thread (https://communities.vmware.com/thread/503236), you replied this:
> There is a big difference between the cpuid options and the cpuidMask options. The cpuid options are used to modify guest requirements and the cpuidMask options are used to modify host capabilities. They are not interchangeable. That, I believe, is the crux of your problem.
If that's correct, then I think we are specifying the guest requirements. And, empirically, by altering these masks in specified manner. we've successfully been able to suspend/resume/migrate across all Westmere/Sandybridge. If we don't specify mask in this way, suspend/resume/migrates don't work.
Matt
MattPietrek wrote:
Hey Jim,
I'm a little confused by your reply, and perhaps I'm misunderstanding something fundamental.
Here's exactly what we're putting in the .vmx file:
cpuid.1.eax = "00000000000000100000011001010001"
cpuid.1.ecx = "00000010100110001110001000111111"
cpuid.1.edx = "10001111111010111111101111111111"
cpuid.80000001.ecx = "00000000000000000000000000000001"
cpuid.80000001.edx = "00101000000100000000100000000000"
cpuid.d.eax = "00000000000000000000000000000000"
cpuid.d.ecx = "00000000000000000000000000000000"
cpuid.d.edx = "00000000000000000000000000000000"
Now, in a prior thread (https://communities.vmware.com/thread/503236), you replied this:
> There is a big difference between the cpuid options and the cpuidMask options. The cpuid options are used to modify guest requirements and the cpuidMask options are used to modify host capabilities. They are not interchangeable. That, I believe, is the crux of your problem.
If that's correct, then I think we are specifying the guest requirements. And, empirically, by altering these masks in specified manner. we've successfully been able to suspend/resume/migrate across all Westmere/Sandybridge. If we don't specify mask in this way, suspend/resume/migrates don't work.
Yes, you are, in fact, specifying guest requirements with the cpuid options. However, the guest requirements that you are specifying cannot be met, because you are requiring features that we have never virtualized.
A physical Haswell CPU implements a set of features, H. A physical Westmere CPU implements a different set of features, W. Using cpuidMask options, you can make it look like your Haswell CPU implements W rather than H.
A VM started on a Haswell CPU implements only a subset of the available Haswell features, H'. A VM started on a Westmere CPU implements only a subset of the available Westmere features, W'. (A VM started on a Haswell system that is masquerading as a Westmere system also implements the feature set W'.)
You want your VMs to use only the W' features, even when started on a Haswell system, so that they can be warm-migrated to a Westmere system. However, the cpuid options you are specifying are for the host feature set W, which is a superset of the usual guest feature set W'.
The derivation of W' from W depends on several factors, including virtual hardware version, guest OS type, and cpuid options.
cpuid.1.eax specifies the family, model, and stepping of the processor. For this option, your specification is fine. It forces a valid Westmere family, model, and stepping.
cpuid.1.ecx is a 32-bit array indicating the availability of a variety of features. Because of the 1's in your configuration option, you have specified that the following features are supported by the guest:
0 - SSE3
1 - PCLMULQDQ
2 - DTES64
3 - MONITOR
4 - DS-CPL
5 - VMX
9 - SSSE3
13 - CMPXCHG16B
14 - xTPR Update Control
15 - PDCM
19 - SSE4.1
20 - SSE4.2
23 - POPCNT
25 - AESNI
Of these, we do not virtualize DTES64, DS-CPL, xTPR Update Control, or PDCM. We do virtualize MONITOR, but only by default for Mac OS X and ESXi guest OS types. Due to the poor performance of virtualized MONITOR/MWAIT, it is not recommended to virtualize MONITOR for Linux guests, which will use it for processor scheduling. We do virtualize VMX, but only when the option vhv.enable=TRUE is specified on an Intel host.
Because of the 0's in your configuration option, you have specified that the remaining features enumerated in cpuid.1.ecx are not supported by the guest. That should not cause migration issues, but may be sub-optimal. For example, we typically virtualize x2APIC on all hosts, regardless of whether or not x2APIC is supported by the physical CPU. A guest OS that is cognizant of x2APIC can generally achieve better performance by using the feature.
I believe that what you actually want to specify is that the virtual CPU should not implement the following post-Westmere features: PCID (bit 17), MOVBE (bit 22), TSC-Deadline (bit 24), XSAVE (bit 26), AVX (bit 28), F16C (bit 29), RDRAND (bit 30). For the rest of the features, you should let the normal derivation take place. This suggests the following configuration option:
cpuid.1.ecx = "-000:-0-0:-0--:--0-:----:----:----:----"
In other words, you should force only the post-Westmere features to be off, and you should not force any features to be on.
I don't believe there are any post-Westmere features in cpuid.1.edx, cpuid.80000001.ecx or cpuid.80000001.edx, so you shouldn't need those options.
Clearing out the 'd' leaf is reasonable, since all of the options in the 'd' leaf are post-Westmere. An alternative is to use a mask for cpuid.0.eax to reduce the maximum input value for basic CPUID information. I believe Westmere systems had a maximum input value of 0xb, suggesting the following option:
cpuid.0.eax = "0000:0000:0000:0000:0000:0000:0000:1011"
Another problem you might be running into is leaf 7. I believe that all of leaf 7 is post-Westmere. Unfortunately, leaf 7 falls under the basic input values supported by Westmere, so the cpuid.0.eax trick won't work. To mask off the leaf 7 features introduced post-Westmere, you should use:
cpuid.7.eax = "0000:0000:0000:0000:0000:0000:0000:0000"
cpuid.7.ebx = "0000:0000:0000:0000:0000:0000:0000:0000"
cpuid.7.ecx = "0000:0000:0000:0000:0000:0000:0000:0000"
cpuid.7.edx = "0000:0000:0000:0000:0000:0000:0000:0000"
I hope that this helps to clarify things.
Thanks Jim - this was incredibly helpful. Exactly the details I need to make informed choices about how to proceed.
A follow up question, if I may: When a VM fails to resume because of CPU compat, is there some way of knowing exactly what features the .vmss requires that the host can't satisfy?
The error in vmware.log shows as:
2015-04-02T19:38:21.281Z| vmx| I120: guestCpuFeatures = 0x2002fd # -->>>>> How can I interpret these?????
2015-04-02T19:38:21.281Z| vmx| I120: Msg_Question:
2015-04-02T19:38:21.281Z| vmx| I120: [msg.checkpoint.cpucheck.fail.feature] The features supported by the processors in this machine are different from the features supported by the processors in the machine on which the virtual machine state was saved.
2015-04-02T19:38:21.281Z| vmx| I120: [msg.checkpoint.cpucheck.fail.hard] Resume on a machine with similar processors.
2015-04-02T19:38:21.281Z| vmx| I120: [msg.checkpoint.restore.cpufail] An error occurred while restoring the CPU state from file "/vmfs/volumes/7379c234-ed540caf/session-07cc2548.vmss".
2015-04-02T19:38:21.281Z| vmx| I120: [msg.checkpoint.resume.softError] Your virtual machine did not resume because of a correctable error. Preserve the suspended state and correct the error, or discard the suspended state.
2015-04-02T19:38:21.281Z| vmx| I120: ----------------------------------------
2015-04-02T19:38:21.733Z| vmx| I120: MsgQuestion: msg.checkpoint.resume.softError reply=0
2015-04-02T19:38:21.733Z| vmx| I120: Module CheckpointLate power on failed.
Earlier in the file, I see:
2015-04-02T19:38:20.865Z| vmx| I120: hostCpuFeatures = 0xc6000fd # (Not sure if this is relevant, and nor sure how to interpret those bits, but grasping at straws).
This was suspended on Haswell, resuming on Westmere.
As for the mask in play for the above, prior to getting your reply, I had been playing with a more restrictive mask that:
Set DTES64, xTPR, PDCM to '0'
Set MWAIT to '-'
Set PBE to '-'
Here's the mask for this iteration:
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.0.eax = 00000000000000000000000000001011
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.1.eax = 00000000000000100000011001010001
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.1.ecx = 0000001010011000001000100011-011
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.1.edx = -0001111111010111111101111111111
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000001.ecx = 00000000000000000000000000000001
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000001.edx = 00101000000100000000100000000000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.d.eax = 00000000000000000000000000000000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.d.ecx = 00000000000000000000000000000000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.d.edx = 00000000000000000000000000000000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000002.0.eax = 0110:0101:0111:0100:0110:1110:0100:1001
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000002.0.ebx = 0010:1001:0101:0010:0010:1000:0110:1100
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000002.0.ecx = 0110:1111:0110:0101:0101:1000:0010:0000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000002.0.edx = 0010:1001:0101:0010:0010:1000:0110:1110
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000003.0.eax = 0101:0101:0101:0000:0100:0011:0010:0000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000003.0.ebx = 0010:0000:0010:0000:0010:0000:0010:0000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000003.0.ecx = 0010:0000:0010:0000:0010:0000:0010:0000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000003.0.edx = 0101:1000:0010:0000:0010:0000:0010:0000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000004.0.eax = 0011:0000:0011:0101:0011:0110:0011:0101
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000004.0.ebx = 0010:0000:0100:0000:0010:0000:0010:0000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000004.0.ecx = 0011:0111:0011:0110:0010:1110:0011:0010
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000004.0.edx = 0000:0000:0111:1010:0100:1000:0100:0111
I promise to experiment with the approach you suggested, only masking off post-Westmere and using '-' for everything else. But for now, just need to understand what's happening with our current masking.
Thanks!
Matt
MattPietrek wrote:
Thanks Jim - this was incredibly helpful. Exactly the details I need to make informed choices about how to proceed.
A follow up question, if I may: When a VM fails to resume because of CPU compat, is there some way of knowing exactly what features the .vmss requires that the host can't satisfy?
If you enable debugging for the VM, you should see something like the following in the vmware.log file:
2015-03-27T16:04:57.196-07:00| vmx| I120: FeatureCompat: Capabilities:
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.fixctr.1 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.genctr.6 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.Intel = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.RDRAND = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.PDPE1GB = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.XSAVE = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.fixctr.0 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.LM = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.genctr.7 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.ENFSTRG = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.MWAIT = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.MOVBE = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.numGenCtrs = 0x8
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.fixedWidth = 0x30
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.genctr.4 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.VMX = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.ABM = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.genWidth = 0x30
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.version = 0x3
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.XCR0_MASTER_YMM_H = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.fixctr.2 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.genctr.5 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: misc.cpuidFaulting = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.PCID = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.FMA = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.AVX2 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.XSAVEOPT = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.microarchitecture.haswell = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.RTM = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.genctr.2 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.SSSE3 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.SSE3 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.NX = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.SSE41 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.BMI2 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.AES = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.PCLMULQDQ = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.SS = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.POPCNT = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.AVX = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.genctr.3 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vt.realmode = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.F16C = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.FSGSBASE = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.DS = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.genctr.0 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.RDTSCP = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.LAHF64 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: hv.capable = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.CMPXCHG16B = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.INVPCID = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.SMEP = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.numFixedCtrs = 0x3
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: vpmc.genctr.1 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.SSE42 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.BMI1 = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.HLE = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: Capability Found: cpuid.XCR0_MASTER_SSE = 0x1
2015-03-27T16:04:57.196-07:00| vmx| I120: FeatureCompat: Requirements:
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.SSE3 - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.PCLMULQDQ - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.SSSE3 - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.FMA - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.CMPXCHG16B - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.PCID - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.SSE41 - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.SSE42 - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.MOVBE - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.POPCNT - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.AES - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.XSAVE - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.AVX - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.F16C - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.RDRAND - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.DS - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.SS - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.FSGSBASE - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.BMI1 - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.AVX2 - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.SMEP - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.BMI2 - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.ENFSTRG - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.INVPCID - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.XCR0_MASTER_SSE - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.XCR0_MASTER_YMM_H - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.XSAVEOPT - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.LAHF64 - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.ABM - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.NX - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.PDPE1GB - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.RDTSCP - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.LM - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: cpuid.Intel - Bool:Min:1
2015-03-27T16:04:57.196-07:00| vmx| I120: VM Features Required: misc.cpuidFaulting - Bool:Min:1
Anything in the requirements list that's not in the capabilities list is a problem.
Here's the mask for this iteration:
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.0.eax = 00000000000000000000000000001011
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.1.eax = 00000000000000100000011001010001
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.1.ecx = 0000001010011000001000100011-011
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.1.edx = -0001111111010111111101111111111
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000001.ecx = 00000000000000000000000000000001
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000001.edx = 00101000000100000000100000000000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.d.eax = 00000000000000000000000000000000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.d.ecx = 00000000000000000000000000000000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.d.edx = 00000000000000000000000000000000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000002.0.eax = 0110:0101:0111:0100:0110:1110:0100:1001
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000002.0.ebx = 0010:1001:0101:0010:0010:1000:0110:1100
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000002.0.ecx = 0110:1111:0110:0101:0101:1000:0010:0000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000002.0.edx = 0010:1001:0101:0010:0010:1000:0110:1110
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000003.0.eax = 0101:0101:0101:0000:0100:0011:0010:0000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000003.0.ebx = 0010:0000:0010:0000:0010:0000:0010:0000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000003.0.ecx = 0010:0000:0010:0000:0010:0000:0010:0000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000003.0.edx = 0101:1000:0010:0000:0010:0000:0010:0000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000004.0.eax = 0011:0000:0011:0101:0011:0110:0011:0101
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000004.0.ebx = 0010:0000:0100:0000:0010:0000:0010:0000
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000004.0.ecx = 0011:0111:0011:0110:0010:1110:0011:0010
2015-04-02T19:38:20.846Z| vmx| I120: DICT cpuid.80000004.0.edx = 0000:0000:0111:1010:0100:1000:0100:0111
I promise to experiment with the approach you suggested, only masking off post-Westmere and using '-' for everything else. But for now, just need to understand what's happening with our current masking.
Thanks!
Matt
I suspect that you need to mask off leaf 7, as I mentioned towards the end of my last posting.
Actually, the feature compatibility information should be in the log file even without enabling debugging.
> Actually, the feature compatibility information should be in the log file even without enabling debugging.
Ah... I'd seen the extra information for some VMs, but not others. And when I did have it, it helped me figure out we had a host with AES disabled in the BIOS. Perhaps the "feature compat" default settings changed between ESXi 5.5 versions?
In any event, is there some way to force "debug" output via a .VMX or host setting? Google's not much help here.
Setting it via the UI doesn't work for us because our VMs aren't persistent on a given host. Our orchestrator deploys/runs/stops/undeploys without human intervention.
Thanks again - you've been most helpful!
Matt
The feature compatibility information is new with virtual hardware version 9. You are probably not seeing it on some VMs because they are HWv8 or older.
MattPietrek wrote:
In any event, is there some way to force "debug" output via a .VMX or host setting? Google's not much help here.
Probably not relevant to this particular issue, but this is the .vmx setting to force debug output:
vmx.buildType = debug
Well, this is odd. Regarding debug output, e.g. "Capability Found:" and "VM Features Required" after many tests on the same host, appears I only
get that output when powering on a VM from a powered-off state.
If I resume the VM from suspend, I don't get those lines. Incredibly frustrating, as that's the situation where I most need to know what compat checks ESXi is doing.
Any other insights?
Thanks again,
Matt
Resuming from suspend is where the debug build type should help. It should tell you the requirements in the checkpoint file, at least.