VMware Cloud Community
justme00
Enthusiast
Enthusiast
Jump to solution

Nested ESXi 5.5 crashes with EPT misconfiguration

HI,

  I have an ESXi 5.5 installed on a physical machine ( CPU : Intel i7-4770s, 32 GB of RAM  ) , where I have 2 nested ESXi servers . All of them ( physical + nested ) run the latest ESXi 5.5 1881737 , and from time to time, the nested ESXi crashes with  [msg.log.error.unrecoverable] VMware ESX unrecoverable error: (vcpu-0) 2014-06-21T20:03:51.580Z| vcpu-0| I120+ vcpu-3:EPT misconfiguration:

  On the 2 nested ESXi the CPU/MMU virtualization was set to automatic , then on the last option ( Use Intel VT-x ... and Intel EPT for MMU virtualization ) . Same situation... Hardware version 9

  I did not try the other 2 options yet, may one of them be the answer to get rid of this ?

I know this is not supported, but maybe someone has any idea on how to fix this... Found this VMware KB: Virtual machines abruptly shut down with an error similar to: MONITOR PANIC: vcpu-0:EPT m... , but it is not applicable Smiley Happy .

  I'm attaching also the vmware.log , zdump and vmmcores...

Thank you for your time .

1 Solution

Accepted Solutions
admin
Immortal
Immortal
Jump to solution

Please try adding the following configuration option to /etc/vmware/config on the physical host:

monitor_control.disable_gphys_abit = TRUE

View solution in original post

Reply
0 Kudos
13 Replies
a_p_
Leadership
Leadership
Jump to solution

Discussion moved from VMware ESXi 5 to Nested Virtualization

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

Please try adding the following configuration option to /etc/vmware/config on the physical host:

monitor_control.disable_gphys_abit = TRUE

Reply
0 Kudos
justme00
Enthusiast
Enthusiast
Jump to solution

Thank you for the support. I've added it , and one of  nested ESXi crashed again Smiley Happy . I'll reboot the physical host now, assuming that the reboot is needed for the option to take effect . This is how my /etc/vmware/config looks like on the physical host :

~ # cat  /etc/vmware/config

libdir = "/usr/lib/vmware"

authd.proxy.nfc = "vmware-hostd:ha-nfc"

authd.proxy.nfcssl = "vmware-hostd:ha-nfcssl"

authd.proxy.vpxa-nfcssl = "vmware-vpxa:vpxa-nfcssl"

authd.proxy.vpxa-nfc = "vmware-vpxa:vpxa-nfc"

authd.fullpath = "/sbin/authd"

monitor_control.disable_gphys_abit = "TRUE"

What I forgot to mention is that the crashes usually take place when I do actions on the VMs running on the nested ESXi ( for example , Shut down guest on all vms running on the nested esxi , or Power on VMs )...

Any other suggestion is more than welcomed ! Smiley Happy

Thank you again .

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

Rebooting the nested ESXi VMs is necessary for the option to take effect.  Rebooting the physical machine should not be necessary.  Make sure that the option doesn't disappear after a reboot.

Reply
0 Kudos
justme00
Enthusiast
Enthusiast
Jump to solution

Thank you very much !!! It seems fixed , I powered on / off the VMs several times, and no issue so far. I will mark your answer as a solution for future persons who have this problem after some more testings Smiley Happy

May I know what was the issue , and what monitor_control.disable_gphys_abit = "TRUE" does ? If you have time , of course, to give me a basic info ... if not , does not matter.

Again , thank you very much !

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

We are still investigating this issue.  It only appears to happen on newer Intel CPUs which support accessed and dirty bits in the extended page tables.  The cause is still unknown.

The configuration option that I suggested disables the code in ESXi that uses accessed bits in the extended page tables (EPT) to identify regions of guest memory that are good candidates for promotion from 4K pages to 2M pages.  Note that this optimization was not even possible on older Intel CPUs, though it is available on all AMD CPUs that support RVI (AMD's equivalent of EPT).

Reply
0 Kudos
justme00
Enthusiast
Enthusiast
Jump to solution

Kind of weird, because I disabled  on my Physical/Nested ESXi hosts the advanced setting : Mem -> AllocGuestLargePage ( parameter set to 0 ) .

Therefore, I assume it should not "search" candidates  for promotion from 4K pages to 2M pages... Or am I wrong ?

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

Unfortunately, I don't believe that the VMM knows about that setting in the vmkernel.  It will back off on large page requests if they are always denied, but it will still try.  If you disable large pages through the VMM option, "monitor_control.disable_mmu_largepages = TRUE", then it won't ever try to allocate them.  I'll file a bug report on this misbehavior.

Setting Mem -> AllocGuestLargePage to 0 may actually exacerbate the EPT misconfiguration issue.  I'll suggest that to those who are investigating this problem.  Thanks!

Reply
0 Kudos
justme00
Enthusiast
Enthusiast
Jump to solution

"Setting Mem -> AllocGuestLargePage to 0 may actually exacerbate the EPT misconfiguration issue.  I'll suggest that to those who are investigating this problem. "


I already had that set to 0 from the beginning , and still I had the EPT misconfiguration issue Smiley Sad . But after setting the parameter that you suggested , everything seems stable, I've been doing tests , reboots, etc and I no longer have the issue.


So I'll mark your answer as the solution , and thank you again for your time and support !

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

justme00 wrote:

I already had that set to 0 from the beginning , and still I had the EPT misconfiguration issue Smiley Sad 

Yes.  I believe this setting actually makes the problem worse, which is why you seem to be having better luck reproducing it than we have had up until now.

Reply
0 Kudos
florindespa
Enthusiast
Enthusiast
Jump to solution

Hi,

I think the BUG is back in ESXi 6.0. I've attached the zdump and core file , pasting from vmkernel.log :

2015-04-07T01:03:08.734Z| vcpu-1| W110: MONITOR PANIC: vcpu-2:EPT misconfiguration: PA 1e0efa000

2015-04-07T01:03:08.734Z| vcpu-1| I120: Core dump with build build-2494585

2015-04-07T01:03:08.734Z| vcpu-2| I120: Exiting vcpu-2

2015-04-07T01:03:08.734Z| vcpu-1| W110: Writing monitor corefile "/vmfs/volumes/55048f3b-177bdffb-a760-7c0507110edb/ESXi01/vmmcores.gz"

2015-04-07T01:03:08.736Z| vcpu-0| I120: Exiting vcpu-0

2015-04-07T01:03:08.736Z| vcpu-3| I120: Exiting vcpu-3

I will do the workaround with monitor_control.disable_gphys_abit = TRUE , and see how it goes.

Thank you.

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

florindespa wrote:

Hi,

I think the BUG is back in ESXi 6.0. I've attached the zdump and core file , pasting from vmkernel.log :

2015-04-07T01:03:08.734Z| vcpu-1| W110: MONITOR PANIC: vcpu-2:EPT misconfiguration: PA 1e0efa000

2015-04-07T01:03:08.734Z| vcpu-1| I120: Core dump with build build-2494585

2015-04-07T01:03:08.734Z| vcpu-2| I120: Exiting vcpu-2

2015-04-07T01:03:08.734Z| vcpu-1| W110: Writing monitor corefile "/vmfs/volumes/55048f3b-177bdffb-a760-7c0507110edb/ESXi01/vmmcores.gz"

2015-04-07T01:03:08.736Z| vcpu-0| I120: Exiting vcpu-0

2015-04-07T01:03:08.736Z| vcpu-3| I120: Exiting vcpu-3

I will do the workaround with monitor_control.disable_gphys_abit = TRUE , and see how it goes.

Thank you.

Your log file shows that your microcode is quite dated.  In fact, your CPU still has RTM support, which is broken on Haswell CPUs and should have been disabled by a microcode update on all production parts.  You might be running into Intel erratum HSD132.  I would suggest updating your BIOS.

Reply
0 Kudos
florindespa
Enthusiast
Enthusiast
Jump to solution

Actually, I'm running the latest BIOS . The funny part is that latest BIOS was released on October 2014 , so it's not old , and in total , this motherboard received 12 BIOS updates since 2013 , so they did not fix it in all this time ? It is an Intel DQ87PG . I will try to ask for an update , but I'm quite confident that they will not answer Smiley Happy .

Is there anything that I can do from software point of view ? Some "magic parameters" ? To be honest , I've only had  this issue once, so maybe it will not happen again ( or very rarely , considering it is just my test lab I'm ok with that ).

Thank you.