Solved: Re: Nested ESXi 5.5 crashes with EPT misconfigurat...

justme00 · ‎06-21-2014

HI,

I have an ESXi 5.5 installed on a physical machine ( CPU : Intel i7-4770s, 32 GB of RAM ) , where I have 2 nested ESXi servers . All of them ( physical + nested ) run the latest ESXi 5.5 1881737 , and from time to time, the nested ESXi crashes with [msg.log.error.unrecoverable] VMware ESX unrecoverable error: (vcpu-0) 2014-06-21T20:03:51.580Z| vcpu-0| I120+ vcpu-3:EPT misconfiguration:

On the 2 nested ESXi the CPU/MMU virtualization was set to automatic , then on the last option ( Use Intel VT-x ... and Intel EPT for MMU virtualization ) . Same situation... Hardware version 9

I did not try the other 2 options yet, may one of them be the answer to get rid of this ?

I know this is not supported, but maybe someone has any idea on how to fix this... Found this VMware KB: Virtual machines abruptly shut down with an error similar to: MONITOR PANIC: vcpu-0:EPT m... , but it is not applicable .

I'm attaching also the vmware.log , zdump and vmmcores...

Thank you for your time .

admin · ‎06-21-2014

Please try adding the following configuration option to /etc/vmware/config on the physical host:

monitor_control.disable_gphys_abit = TRUE

View solution in original post

a_p_ · ‎06-21-2014

Discussion moved from VMware ESXi 5 to Nested Virtualization

admin · ‎06-21-2014

Please try adding the following configuration option to /etc/vmware/config on the physical host:

monitor_control.disable_gphys_abit = TRUE

justme00 · ‎06-21-2014

Thank you for the support. I've added it , and one of nested ESXi crashed again . I'll reboot the physical host now, assuming that the reboot is needed for the option to take effect . This is how my /etc/vmware/config looks like on the physical host :

~ # cat /etc/vmware/config

libdir = "/usr/lib/vmware"

authd.proxy.nfc = "vmware-hostd:ha-nfc"

authd.proxy.nfcssl = "vmware-hostd:ha-nfcssl"

authd.proxy.vpxa-nfcssl = "vmware-vpxa:vpxa-nfcssl"

authd.proxy.vpxa-nfc = "vmware-vpxa:vpxa-nfc"

authd.fullpath = "/sbin/authd"

monitor_control.disable_gphys_abit = "TRUE"

What I forgot to mention is that the crashes usually take place when I do actions on the VMs running on the nested ESXi ( for example , Shut down guest on all vms running on the nested esxi , or Power on VMs )...

Any other suggestion is more than welcomed !

Thank you again .

admin · ‎06-21-2014

Rebooting the nested ESXi VMs is necessary for the option to take effect. Rebooting the physical machine should not be necessary. Make sure that the option doesn't disappear after a reboot.

justme00 · ‎06-21-2014

Thank you very much !!! It seems fixed , I powered on / off the VMs several times, and no issue so far. I will mark your answer as a solution for future persons who have this problem after some more testings

May I know what was the issue , and what monitor_control.disable_gphys_abit = "TRUE" does ? If you have time , of course, to give me a basic info ... if not , does not matter.

Again , thank you very much !

admin · ‎06-21-2014

We are still investigating this issue. It only appears to happen on newer Intel CPUs which support accessed and dirty bits in the extended page tables. The cause is still unknown.

The configuration option that I suggested disables the code in ESXi that uses accessed bits in the extended page tables (EPT) to identify regions of guest memory that are good candidates for promotion from 4K pages to 2M pages. Note that this optimization was not even possible on older Intel CPUs, though it is available on all AMD CPUs that support RVI (AMD's equivalent of EPT).

justme00 · ‎06-21-2014

Kind of weird, because I disabled on my Physical/Nested ESXi hosts the advanced setting : Mem -> AllocGuestLargePage ( parameter set to 0 ) .

Therefore, I assume it should not "search" candidates for promotion from 4K pages to 2M pages... Or am I wrong ?

admin · ‎06-21-2014

Unfortunately, I don't believe that the VMM knows about that setting in the vmkernel. It will back off on large page requests if they are always denied, but it will still try. If you disable large pages through the VMM option, "monitor_control.disable_mmu_largepages = TRUE", then it won't ever try to allocate them. I'll file a bug report on this misbehavior.

Setting Mem -> AllocGuestLargePage to 0 may actually exacerbate the EPT misconfiguration issue. I'll suggest that to those who are investigating this problem. Thanks!

justme00 · ‎06-21-2014

"Setting Mem -> AllocGuestLargePage to 0 may actually exacerbate the EPT misconfiguration issue. I'll suggest that to those who are investigating this problem. "

I already had that set to 0 from the beginning , and still I had the EPT misconfiguration issue . But after setting the parameter that you suggested , everything seems stable, I've been doing tests , reboots, etc and I no longer have the issue.

So I'll mark your answer as the solution , and thank you again for your time and support !

admin · ‎06-21-2014

justme00 wrote:

I already had that set to 0 from the beginning , and still I had the EPT misconfiguration issue

Yes. I believe this setting actually makes the problem worse, which is why you seem to be having better luck reproducing it than we have had up until now.

florindespa · ‎04-06-2015

Hi,

I think the BUG is back in ESXi 6.0. I've attached the zdump and core file , pasting from vmkernel.log :

2015-04-07T01:03:08.734Z| vcpu-1| W110: MONITOR PANIC: vcpu-2:EPT misconfiguration: PA 1e0efa000

2015-04-07T01:03:08.734Z| vcpu-1| I120: Core dump with build build-2494585

2015-04-07T01:03:08.734Z| vcpu-2| I120: Exiting vcpu-2

2015-04-07T01:03:08.734Z| vcpu-1| W110: Writing monitor corefile "/vmfs/volumes/55048f3b-177bdffb-a760-7c0507110edb/ESXi01/vmmcores.gz"

2015-04-07T01:03:08.736Z| vcpu-0| I120: Exiting vcpu-0

2015-04-07T01:03:08.736Z| vcpu-3| I120: Exiting vcpu-3

I will do the workaround with monitor_control.disable_gphys_abit = TRUE , and see how it goes.

Thank you.

admin · ‎04-07-2015

florindespa wrote:

Hi,

I think the BUG is back in ESXi 6.0. I've attached the zdump and core file , pasting from vmkernel.log :

2015-04-07T01:03:08.734Z| vcpu-1| W110: MONITOR PANIC: vcpu-2:EPT misconfiguration: PA 1e0efa000

2015-04-07T01:03:08.734Z| vcpu-1| I120: Core dump with build build-2494585

2015-04-07T01:03:08.734Z| vcpu-2| I120: Exiting vcpu-2

2015-04-07T01:03:08.734Z| vcpu-1| W110: Writing monitor corefile "/vmfs/volumes/55048f3b-177bdffb-a760-7c0507110edb/ESXi01/vmmcores.gz"

2015-04-07T01:03:08.736Z| vcpu-0| I120: Exiting vcpu-0

2015-04-07T01:03:08.736Z| vcpu-3| I120: Exiting vcpu-3

I will do the workaround with monitor_control.disable_gphys_abit = TRUE , and see how it goes.

Thank you.

Your log file shows that your microcode is quite dated. In fact, your CPU still has RTM support, which is broken on Haswell CPUs and should have been disabled by a microcode update on all production parts. You might be running into Intel erratum HSD132. I would suggest updating your BIOS.

florindespa · ‎04-15-2015

Actually, I'm running the latest BIOS . The funny part is that latest BIOS was released on October 2014 , so it's not old , and in total , this motherboard received 12 BIOS updates since 2013 , so they did not fix it in all this time ? It is an Intel DQ87PG . I will try to ask for an update , but I'm quite confident that they will not answer .

Is there anything that I can do from software point of view ? Some "magic parameters" ? To be honest , I've only had this issue once, so maybe it will not happen again ( or very rarely , considering it is just my test lab I'm ok with that ).

Thank you.

All

Nested ESXi 5.5 crashes with EPT misconfiguration