VMware Cloud Community
MattPietrek
Contributor
Contributor
Jump to solution

Can't migrate VM between ESXi 5.5 versions: The product version of the destination host does not support one or more CPU features

We're getting an error message that I'm trying to make sense of. I understand very well the notions of CPU compatibility, CPUID, masking and so forth, having worked on a competing hypervisor. However, this error seems erroneous (or needs better wording) based on my understanding:

Here's the message:

----

A general system error occurred: The product version of the destination host does not support one or more CPU features currently in use by the virtual machine.

Such features from CPUID level 0x1 register 'ecx' are indicated with a '1' bit: x00x:xxx0:xx0x:x000:11xx:00x0:00xx:11xx

----


This occurs when migrating between two ESXi hosts with the same physical processors - In this case, an X5650 Westmere.


The source host is on ESXi build 2068190 (5.5). The destination host is on ESXi build 1474528.


The bits that it seems to be complaining about are:


DTES64

Monitor/MWait

xTPR

PDCM

The punch line appears to be: "The product version of the destination host". However, I find it hard to believe that support for these features were added to ESXi between the two releases. If they were, VMware was certainly silent about it.

For what it's worth, we explicitly set the CPUID masks in our VMs (No, EVC isn't an option for us at this time.) Here's the mask:

cpuid.1.eax = "00000000000000100000011001010001"

cpuid.1.ecx = "00000010100110001110001000111111"

cpuid.1.edx = "10001111111010111111101111111111"

cpuid.80000001.ecx = "00000000000000000000000000000001"

cpuid.80000001.edx = "00101000000100000000100000000000"

cpuid.d.eax = "00000000000000000000000000000000"

cpuid.d.ecx = "00000000000000000000000000000000"

cpuid.d.edx = "00000000000000000000000000000000"

Note that the bits the message complains about (2, 3, 14, 15) *are* in fact forced to '1' in our cpuid.1.ecx mask. Meaning (as I understand it), ESXi won't run the VM unless the host processor supports the feature. And in our case, the VM will happily start on either node, with either version. It just won't migrate between them.

So, long story short, is this an ESXi bug? Is the error message misleading? Am I not understanding something?

Thanks,


Matt

Tags (2)
1 Solution

Accepted Solutions
admin
Immortal
Immortal
Jump to solution

To use VM feature masks in this way, you should replace all of the 1's in your masks with -'s:

cpuid.1.eax = "00000000000000-000000--00-0-000-"

cpuid.1.ecx = "000000-0-00--000---000-000------"

cpuid.1.edx = "-000-------0-0-------0----------"

cpuid.80000001.ecx = "0000000000000000000000000000000-"

cpuid.80000001.edx = "00-0-000000-00000000-00000000000"

cpuid.d.eax = "00000000000000000000000000000000"

cpuid.d.ecx = "00000000000000000000000000000000"

cpuid.d.edx = "00000000000000000000000000000000"


The zeroes will clear the features that are not available on your Westmere hosts, and the dashes will leave the other features alone.  The problem with the ones in your masks was that you forced some features on which would normally have been off.


View solution in original post

17 Replies
admin
Immortal
Immortal
Jump to solution

Can you even vMotion these machines between two systems running ESXi build 2068190?


Both ESXi builds do support MONITOR/MWAIT (though this feature is typically hidden for all but Mac OS X and ESXi guests).  However, neither ESXi build supports DTES64, xTPR or PDCM.  Specifying any of these three features should be sufficient to prevent any vMotion of the VM.

admin
Immortal
Immortal
Jump to solution

MattPietrek wrote:

Note that the bits the message complains about (2, 3, 14, 15) *are* in fact forced to '1' in our cpuid.1.ecx mask. Meaning (as I understand it), ESXi won't run the VM unless the host processor supports the feature. And in our case, the VM will happily start on either node, with either version. It just won't migrate between them.

Actually, ESXi will allow you to specify any CPUID feature bits you like, including the specification of unsupported features.  However, for vMotion compatibility, VC checks to see if the hypervisor running on the destination host advertises support for the features.

So, long story short, is this an ESXi bug? Is the error message misleading? Am I not understanding something?

The error message is pretty accurate.  Though the physical processor supports the features, the hypervisor is not capable of virtualizing them.

Is this an ESXi bug?  Probably.  These configurations shouldn't really power on at all, since they advertise CPU features that are not properly virtualized.

What are you trying to accomplish by specifying DTES64, xTPR and PDCM?

Reply
0 Kudos
MattPietrek
Contributor
Contributor
Jump to solution

Thanks much for the quick answer. It's very helpful, at least with regard to expectations.

> Can you even vMotion these machines between two systems running ESXi build 2068190?

Indeed we can vmotion between two 2068190 hosts. In fact, I can go between Westmere and Sandbridge hosts, if they're both 2068190.

For what it's worth, the masks we specify in the .VMX is (to the best of my knowledge) something akin to the VMware's EVC mask for Westmere.

Looking at that cpuid.1.ecx value, i.e. 00000010100110001110001000111111, it appears like the  DTES64, Monitor/MWait, xTPR and PDCM bits are set in it.

Long story short, we're doing a "poor man's EVC" by specifying the mask we do. It's worked perfectly for supporting suspend/resume across hosts with different processors. But now with vMotion, its causing problems.

Matt

Reply
0 Kudos
MattPietrek
Contributor
Contributor
Jump to solution

Jim,

Wanted to follow up with you to make sure this thread doesn't go dormant.


I think we're on to something really important here. Specifically, (as I understand it) the Westmere EVC (L3) flags, when specified in a VM's config, can actually prevent migration of a VM between ESXi 5.5 versions.

Is there something I'm overlooking in this summation? And is there some way to force vMotion to ignore those bits?

Thanks,

Matt

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

What hardware version are your VMs?  I do recall that there were some bugs with the old-style EVC masks.  You may find that your masks work for hardware version 8 and older VMs.

The new-style Westmere EVC masks (for hardware version 9 and newer) are:

featMask.evc.cpuid.Intel = "Val:1"

featMask.evc.cpuid.FAMILY = "Val:6"

featMask.evc.cpuid.MODEL = "Val:0x25"

featMask.evc.cpuid.STEPPING = "Val:1"

featMask.evc.cpuid.NUMLEVELS = "Val:0xb"

featMask.evc.cpuid.NUM_EXT_LEVELS = "Val:0x80000008"

featMask.evc.cpuid.CMPXCHG16B = "Val:1"

featMask.evc.cpuid.DS = "Val:1"

featMask.evc.cpuid.LAHF64 = "Val:1"

featMask.evc.cpuid.LM = "Val:1"

featMask.evc.cpuid.MWAIT = "Val:1"

featMask.evc.cpuid.NX = "Val:1"

featMask.evc.cpuid.SS = "Val:1"

featMask.evc.cpuid.SSE3 = "Val:1"

featMask.evc.cpuid.SSSE3 = "Val:1"

featMask.evc.cpuid.SSE41 = "Val:1"

featMask.evc.cpuid.POPCNT = "Val:1"

featMask.evc.cpuid.RDTSCP = "Val:1"

featMask.evc.cpuid.SSE42 = "Val:1"

featMask.evc.cpuid.VMX = "Val:1"

featMask.evc.hv.capable = "Val:1"

featMask.evc.cpuid.AES = "Val:1"

featMask.evc.cpuid.PCLMULQDQ = "Val:1"

featMask.evc.vt.realmode = "Val:1"


These should be specified in the /etc/vmware/config file on the host.

Message was edited by: Jim Mattson -- Corrected the options.

Reply
0 Kudos
MattPietrek
Contributor
Contributor
Jump to solution

Thanks Jim - This is incredibly helpful. Making progress here. Some answers & more questions:

> What hardware version are your VMs?

They are a mix of VMs between HW 7 and HW 10. We don't force our customers to upgrade to HW V10.

> You may find that your masks work for hardware version 8 and older VMs.

Interesting - So does this mean those "extra" 4 features (DTES64, etc...) are completely ignored by "old" HW versions, but HW V 9, 10, and 11 are aware of them, and so run into vmotion compat check issues?

> ...some bugs with the old-style EVC masks

When say "old-style" do you mean these guys:   cpuid.1.ecx = "00000010100110001110001000111111"


Thanks again,

Matt


Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

MattPietrek wrote:

> You may find that your masks work for hardware version 8 and older VMs.

Interesting - So does this mean those "extra" 4 features (DTES64, etc...) are completely ignored by "old" HW versions, but HW V 9, 10, and 11 are aware of them, and so run into vmotion compat check issues?

First, let's separate MWAIT from the others.  MWAIT support was introduced for Mac OS X guests, which won't boot without it.  However, virtualized MWAIT has a number of issues that make it generally undesirable.  The monitor is implemented by page-based write-protection, rather than by the hardware cache-coherency system.  This means waking up from a virtual MWAIT is quite a bit slower than waking up from a physical MWAIT.  Moreover, since the monitor can be triggered by any access to the page, we report (in CPUID leaf 5) a minimum and maximum monitor line size of one page (4096) rather than one cache line (64).  Most operating systems that use MONITOR/MWAIT are not clever enough to inspect the attributes reported in CPUID leaf 5.  They assume cache line granularity, which may result in spurious wake events.

At any rate, MWAIT should not be a problem, unless you are running ESXi in a virtual machine without MWAIT support.

So, what about DTES64, xTPR, and PDCM?  Yes, these were included in the old EVC masks for Westmere.  However, these were bugs in the Westmere baseline definition, because the virtual CPU never supported these features.  In essence, any VM powered on in an EVC cluster was lying about its support for these features.

Before hardware version 9, the compatibility check was as you had previously surmised: the physical CPU had to support the features in use by the VM.

When we revamped EVC in hardware version 9, we made a few changes that couldn't be implemented for older hardware versions without breaking existing installations.  First, we audited all of the EVC baseline definitions and removed features that had been erroneously enabled (and required of the physical processor).  Second, we changed the compatibility checks so that a *virtual* CPU at the destination had to support the features rather than a *physical* CPU at the destination.

So, VC now uses a different compatibility check depending on the VM hardware version.

To make a long story short, if you want to emulate EVC with CPUID masks for hardware version 9 and later, you should probably derive the masks from the new-style EVC features that I listed previously.  It wouldn't hurt to drop the unimplemented features from the masks used for previous hardware versions if you want to use consistent masks across the board.

When say "old-style" do you mean these guys:   cpuid.1.ecx = "00000010100110001110001000111111"

Yes, this style of masking should probably be considered as deprecated (though it's still the only way to mask certain features, like the processor brand string).

MattPietrek
Contributor
Contributor
Jump to solution

Thanks again Jim. You've provided exactly the sort of technical details I need in order to move forward. One more observation and a question:

Observation: I've noticed that all (or nearly all) VMs that have failed to migrate with the "The product version of the destination host does not support ....x000:x0x0:xx0x:x000:11x0:00x0:000x:11xx" message have the following things in common:

  • They are Windows XP/Windows 2003 server guestOS.
  • The migration is between two different ESXi versions.
  • Hardware version was v7 or V8

Interestingly, I can migrate these VMs between hosts that have the same ESXi version.  Any insight on this particular behavior, particularly with regards to the guestOS?

Next:

We specify our masking in the .vmx file, rather than on the host.

Question 1: Should the new-style EVC masks work by simply dropping them into the .vmx file exactly like this:  evc.featMask.cpuid.Intel = "Val:1"

Question 2: For features that are specifically disabled, do I need to set explicitly set to "Val:0"?

Question 3: Can I mix the old/new styles, so as to explicitly force the brand string?


Thanks again,


Matt


Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

MattPietrek wrote:

Question 1: Should the new-style EVC masks work by simply dropping them into the .vmx file exactly like this:  evc.featMask.cpuid.Intel = "Val:1"

Oops.  I mixed up the prefix.  That should be featMask.evc rather than evc.featMask. 

You cannot specify EVC options in a .vmx file.  They will be ignored.

You can specify per-VM options with a 'featmask.vm' prefix, but the semantics are different.

Featmask.evc options are used to modify host capabilities.

Featmask.vm options are used to modify guest requirements.

Featmask.evc options only indirectly affect guest requirements, in that the guest requirements will normally be a subset of the (masked) host capabilities.

For example, if you specify featMask.evc.cpuid.MWAIT = "Val:1", that means that the host is capable of virtualizing MWAIT.  By default, most guests (except Mac OS X and ESXi types) will still have MWAIT masked out of their feature sets.  However, if you specify featMask.vm.cpuid.MWAIT = "Val:1", that means that the guest requires MWAIT virtualization.  Any guest with this setting will report the MWAIT capability in its CPUID info.

Question 2: For features that are specifically disabled, do I need to set explicitly set to "Val:0"?

You can leave off the Val:0 settings if you specify:

featureCompat.evc.completeMasks = "TRUE"


Question 3: Can I mix the old/new styles, so as to explicitly force the brand string?

Yes.  Featmask.evc and the old-style cpuid masks do different things, and in combination they behave as you'd expect.  Featmask.vm masks are applied after the old-style cpuid masks, so the featMask.vm masks take precedence.

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

Jim Mattson wrote:

Before hardware version 9, the compatibility check was as you had previously surmised: the physical CPU had to support the features in use by the VM.

Sorry.  This statement is incorrect.  All of that old EVC masking stuff is beginning to come back to me.

You said that you are specifying the following options in your .vmx file:

cpuid.1.eax = "00000000000000100000011001010001"

cpuid.1.ecx = "00000010100110001110001000111111"

cpuid.1.edx = "10001111111010111111101111111111"

cpuid.80000001.ecx = "00000000000000000000000000000001"

cpuid.80000001.edx = "00101000000100000000100000000000"

cpuid.d.eax = "00000000000000000000000000000000"

cpuid.d.ecx = "00000000000000000000000000000000"

cpuid.d.edx = "00000000000000000000000000000000"


It should have occurred to me that these are not EVC masks.  They are VM feature masks.

The EVC masks you are trying to emulate would be:

cpuidMask.1.eax = "00000000000000100000011001010001"

cpuidMask.1.ecx = "00000010100110001110001000111111"

cpuidMask.1.edx = "10001111111010111111101111111111"

cpuidMask.80000001.ecx = "00000000000000000000000000000001"

cpuidMask.80000001.edx = "00101000000100000000100000000000"

cpuidMask.d.eax = "00000000000000000000000000000000"

cpuidMask.d.ecx = "00000000000000000000000000000000"

cpuidMask.d.edx = "00000000000000000000000000000000"



There is a big difference between the cpuid options and the cpuidMask options.  The cpuid options are used to modify guest requirements and the cpuidMask options are used to modify host capabilities.  They are not interchangeable.  That, I believe, is the crux of your problem.

Reply
0 Kudos
MattPietrek
Contributor
Contributor
Jump to solution

Jim,

Thanks yet again for the all the info. I may not have been completely clear when I said "poor man's EVC". Here's the exact scenario.

We have pools of hosts. Some Westmere, some Sandybridge, some Ivybridge. These hosts are not clustered, and also (by definition) don't have EVC enabled on them.

We want our VMs to suspend/resume on any host. We've had that working for a while with the .vmx options I mentioned, i.e "cpuid.1.eax=....."

We now want our VMs to be able to vMotion between hosts. For the most part they do, with the exception of moving VMs between ESXi versions.

Based on my understanding, I *think* we want each VM to be constrained to Westmere features, since this is our lowest common denominator hosts.

Following on that understanding, I *think* what we want (in the .vmx) is VM feature masks. I also *think* that the cpuidMask isn't relevant here.

Make sense? And does this change any of your advice?

Thanks,

Matt

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

To use VM feature masks in this way, you should replace all of the 1's in your masks with -'s:

cpuid.1.eax = "00000000000000-000000--00-0-000-"

cpuid.1.ecx = "000000-0-00--000---000-000------"

cpuid.1.edx = "-000-------0-0-------0----------"

cpuid.80000001.ecx = "0000000000000000000000000000000-"

cpuid.80000001.edx = "00-0-000000-00000000-00000000000"

cpuid.d.eax = "00000000000000000000000000000000"

cpuid.d.ecx = "00000000000000000000000000000000"

cpuid.d.edx = "00000000000000000000000000000000"


The zeroes will clear the features that are not available on your Westmere hosts, and the dashes will leave the other features alone.  The problem with the ones in your masks was that you forced some features on which would normally have been off.


admin
Immortal
Immortal
Jump to solution

Oops...With the exception of the family/model/stepping in cpuid.1.eax:

cpuid.1.eax = "00000000000000100000011001010001"

Reply
0 Kudos
MattPietrek
Contributor
Contributor
Jump to solution

Jim,

I think we're really close here. However, in your latest suggested masks:

cpuid.1.eax = "00000000000000-000000--00-0-000-"

cpuid.1.ecx = "000000-0-00--000---000-000------"

cpuid.1.edx = "-000-------0-0-------0----------"

cpuid.80000001.ecx = "0000000000000000000000000000000-"

cpuid.80000001.edx = "00-0-000000-00000000-00000000000"

cpuid.d.eax = "00000000000000000000000000000000"

cpuid.d.ecx = "00000000000000000000000000000000"

cpuid.d.edx = "00000000000000000000000000000000"

There are '-' for MWAIT, DTES64, xTPR and PDCM.  While I'm guessing we'd be fine using them with ESXi 5.5 across all hosts, is it fair to say that some later ESXi (say, 6) could support them, in which case vMotion from a 6.0 to 5.5 box might not work?

Or to put it another way, would it be safer, version-compat-wise, to just 0 out those bits rather than '-' them?

Thanks,

Matt

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

Our policy is to introduce support for new virtual CPU features only with a new virtual hardware version. If some later ESXi supported these features, it would only be for some future hardware version not supported on ESXi 5.5.  So, I don't think you need to worry.

Reply
0 Kudos
MattPietrek
Contributor
Contributor
Jump to solution

Hey Jim,

Sorry to dig up an old thread, but another question: After fixing the cpuid.1.ecx mask, the migration between versions doesn't complain about the four bits in question.

However, it now errors like this:

The product version of the destination host does not support one or more CPU features currently in use by the virtual machine. Such features from CPUID level 0x1 register \'edx\' are indicated with a \'1\' bit: 100x:xxxx:xxx0:xxxx:xxxx:x0xx:xxxx:xxxx"]


This looks like the PBE (page break enable) feature, which has been around forever, as best I can tell.


Before I go mask it off, any ideas why this feature would trigger problems?


Thanks!


Matt


Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

We have never virtualized PBE (pending break enable).  You should just mask it off.

Reply
0 Kudos