Solved: VM boots into EFI environment after crash

rene_bos · ‎12-28-2012

Hi everyone,

I am breaking my head on something which seems like a bug to me.

Some weeks ago, the ESXi host I was running some VMs on crashed (power faillure) and after powering back up and starting the VMs on this standalone host, one of the VMs didn't boot up properly.

This VM runs Windows Server 2012 and booted up this time into the EFI environment.

After some investigation, the EFI configuration is gone. When opening up the Boot Maintenance Manager and choose "Configure boot options", no options are listed. When adding a new boot option and using the save function, the settings are lost instantly. When comparing the settings with other 2012 VMs I have running, I am missing alot of configuration like the boot options, and some driver options.

Both options will not save after exiting the menu for adding those options. When choosing "Boot from a file" I can startup the VM using the "NO VOLUME LABEL\EFI\Boot\bootx64.efi" file. After rebooting the VM, it will be back at the EFI environment.

It's very hard to find information about the EFI BIOS so I hope someone here has some experience with it.

I even tried restoring the VM files of about 2 weeks ago before the crash, but even with those files the VM will boot into EFI environment with no settings at all. So maybe the problem was caused earlier; very weird! Also not being able to save new settings to the EFI environment seems very strange to me.

Thanks for any replies! 😃

dariusd · ‎12-28-2012

Hi René,

Hmmm... There is something very, very wrong with that NVRAM file. It looks like the power outage has corrupted the NVRAM file in a way that confuses the firmware unrecoverably. I'll look into this further next week (next year!) when I'm back in the office, and will file an internal bug report if I find something that we can make more robust.

Although I generally do not advise this for EFI virtual machines, in your situation your best plan of action is probably to power off the VM, delete the .nvram file from the VM's directory (or rename it to something else if you want to be extra careful), and then power it back on again. A fresh .nvram file will be created automatically when the VM starts. For most reasonably normal Windows installations, the VM should then automatically boot into the installed OS.

(For the benefit of other forum readers who might come across this thread, I must emphasize: Do not delete the NVRAM of an EFI virtual machine unless you understand the consequences! It looks like it is appropriate to delete NVRAM in this particular situation and very few others.)

Please let me know if that gets things going again!

And I'll save one of our forum regulars, WoodyZ, from having to make his regular comment for this situation: If you haven't already done so, you may wish to consider investing in an uninterruptible power supply (UPS) to keep your system running through power failures or to at least allow an orderly shutdown instead of a hard crash.

Cheers,

--

Darius

View solution in original post

dariusd · ‎12-28-2012

Hi rene_bos, and welcome to the VMware Communities!

I have some experience with EFI... I'm one of the VMware developers who works on making the virtual EFI firmware. :smileygrin:

I'd like to take a look at the vmware.log for that VM, and also the VM's ".nvram" file. Could you look inside the VM's directory and grab the latest vmware.log and the NVRAM file, and attach them both to your reply here? (Just use the Browse... button below to attach the two files, please!)

I'd also be interested to know if you're really using the Driver Options, and, if so, what you're using them for. Normally there are no Driver Options and they will only be present if you have manually add them, and we don't anticipate there being many users wanting to do that in a virtual machine except possibly if you're using PCI passthrough or some 3rd-party EFI software.

Cheers,

--

Darius

rene_bos · ‎12-28-2012

Hi Darius,

Wow; fast reply! 😃

Cool to see that developers are participating in the communities!

Anyways; I was about to upload the logs but I see the "Enable logging" checkbox isn't ticked for this VM.

Does it help if I enable it now, shutdown- startup the VM?

The nvram file has been added to this post.

Regarding the driver options; no I don't use them, but it seems that the Windows Server 2012 setup did create some entries at my other VMs =).

Thanks!

René

dariusd · ‎12-28-2012

Hi René,

If you can enable logging and restart the VM, that'd be great. Let me know if that would be too disruptive. I assume you will need to follow the same steps as you mentioned above in order to get the VM to boot into Windows when you restart it...

I hadn't noticed Windows Server 2012 using Driver Options before... Interesting. I'll have to take a look at that.

Cheers,

--

Darius

rene_bos · ‎12-28-2012

Hey Darius,

No problem at all, I just shutdown the VM, enabled logging and just booted it back into Windows using the manual boot from file option.

After the VM started up, I shut it down once more and exported the log file; it's added into this post.

And; whoops; you are right, no driver options are active.

The only boot option active should be the "Windows Boot Manager". Inside are multiple "devices":

EFI Virtual disk (0.0)

EFI VMware Virtual IDE CDROM Drive

EFI Internal Shell

EFI Network

That's what confused me with driver options 😃

Thanks,

René

dariusd · ‎12-28-2012

Hi René,

Hmmm... There is something very, very wrong with that NVRAM file. It looks like the power outage has corrupted the NVRAM file in a way that confuses the firmware unrecoverably. I'll look into this further next week (next year!) when I'm back in the office, and will file an internal bug report if I find something that we can make more robust.

Although I generally do not advise this for EFI virtual machines, in your situation your best plan of action is probably to power off the VM, delete the .nvram file from the VM's directory (or rename it to something else if you want to be extra careful), and then power it back on again. A fresh .nvram file will be created automatically when the VM starts. For most reasonably normal Windows installations, the VM should then automatically boot into the installed OS.

(For the benefit of other forum readers who might come across this thread, I must emphasize: Do not delete the NVRAM of an EFI virtual machine unless you understand the consequences! It looks like it is appropriate to delete NVRAM in this particular situation and very few others.)

Please let me know if that gets things going again!

And I'll save one of our forum regulars, WoodyZ, from having to make his regular comment for this situation: If you haven't already done so, you may wish to consider investing in an uninterruptible power supply (UPS) to keep your system running through power failures or to at least allow an orderly shutdown instead of a hard crash.

Cheers,

--

Darius

rene_bos · ‎12-29-2012

Hey Darius,

Well shoot me! Such an easy solution after hours of searching how to restore the EFI configuration! 🙂

Deleted the nvram file, started up the VM and everything is working perfectly now!

Thanks alot! 🙂

Cheers,

René

lejim · ‎03-03-2013

Hi, I was going to fill a SR after I just ran into this situation.

I have just set up a Veeam backup infrastructure and The exact same problem appear randomly after a backup ( so a snapshot ) on my 2012 efi vm, they can't boot anymore since nvram file get corrupted. i must delete it and let esx host recreate it. So the surebackup is effectively inoperant cause backuped vm can't boot whithout manual modification.

have you got any update about this issue now ?

regards,

edit.

My 2008r2 vm seems unaffected though

dariusd · ‎03-03-2013

Wow... thanks for providing that information! I'll add your observations to our internal bug report on the issue.

It sure seems like Veeam has some role to play in this corruption.

I'll investigate adding a workaround into our virtual EFI firmware to try to prevent/recover from this situation, but it's looking like we should also report it to Veeam as well.

Cheers,

--

Darius

lejim · ‎03-03-2013

I would not blame veeam since on those vm i have never done "regular" snapshots so I'm not sure if it's veeam related.

If I got time tomorrow I will just deploy a new 2012 vm and play with it, regular snapshot and veeam backups.

I'll post everything I can find out.

regards

edit. Using esxi 5.1 with latests patch

dariusd · ‎03-03-2013

Thanks, I'd greatly appreciate hearing of what you discover.

What we know so far is that something is causing the EFI "Boot Options" (i.e. the entries visible in the EFI Boot Manager) to replicate until eventually the VM's EFI NVRAM runs out of space and the VM fails to boot. It might be a few snapshots or a few VM reboot cycles before the problem can be first seen as multiple entries in EFI Boot Manager, and it might be many more snapshots or reboot cycles before the VM fails to boot due to insufficient NVRAM space.

A number of the instances of this failure that we've seen here (maybe all of them?) have Veeam in the picture in one way or another. I haven't been able to replicate the problem here at VMware, although I have not yet had the opportunity to try with Veeam to test if there is an interaction there. Your feedback suggests that I should definitely look more closely at the Veeam angle.

Please do keep us posted on your findings!

Cheers,

--

Darius

lejim · ‎03-04-2013

Hi,

I run tests during all day,

I have set up a new fresh w2012 vm.

Started lot of snapshot ( quiesced and not quiesced ) from VI client : no issue occured.

Started a "continuous" backup job from veeam : no issue after 1 hour ( nearly 30 snapshots ) ( using application aware guest processing )

Started a "continuous" replication job : no issue after 30 replication. ( using application aware guest processing )

Since issue happens only ( right now ) on my 2012 AD Domain Controller, I setup the vm as DC.

again no issue after the same turing for 4 hours.

ok not the best test protocol but hey, I had some work to do

meanwhile 1 time of 5 ( approx. ) my "real" dc get their efi corrupted....

maybe it's not related to snapshot after all...

What we know so far is that something is causing the EFI "Boot Options" (i.e. the entries visible in the EFI Boot Manager) to replicate until eventually the VM's EFI NVRAM runs out of space and the VM fails to boot. It might be a few snapshots or a few VM reboot cycles before the problem can be first seen as multiple entries in EFI Boot Manager, and it might be many more snapshots or reboot cycles before the VM fails to boot due to insufficient NVRAM space.

About that I can say that reboot seems not to be the culprit since my corrupted vm are my DC and they almost never get rebooted.

For instance, I had my vm corrupted, fixed the nvram, boot it up and get it corrupted in my veeam lab. after some backup. It was also corrupted on the host since I tried to reboot it to see what will happen and tada ! nvram was dead.

dariusd · ‎10-10-2013

Hi folks,

A positive but somewhat-belated update on this matter -- I should have posted this a few months ago...

The Veeam folks figured out the cause of the problem, and Veeam Backup & Replication 6.5 patch 3 contains a workaround for the issue: KB1751: Patch 3 Release Notes for Veeam Backup & Replication 6.5. The underlying cause appears to have been a quirk in a Windows API call used by B&R.

If you continue to encounter this issue after updating, please post back here!

Thanks,

--

Darius

abruso · ‎11-21-2013

I'm so glad I found this thread! I have been banging my head with this problem for the past week now (ever since updating to VEEAM 7.0.)

A little background on our problem. One host is using local storage. We have about 8 VM's. 5-6 of them are Windows 2012. All but one uses legacy BIOS. The one that has EFI BIOS is also a domain controller. About a month ago, we had thought the VM crashed and when it rebooted we got that boot manager menu and no matter what we did we had to boot it manually using that NO VOLUME LABEL option. We didn't have any problems with VEEAM backups or replications after that, but whenever we rebooted that VM we had to manually boot it up.

This past week, I updated VEEAM to 7.0 R2. That's when all hell broke loose. During VEEAM backup/replication jobs, our host would pink screen. I opened a case with VEEAM, and they said there's not much they could do because their logs don't reveal anything that's going on with the vmware host side. So I opened a case with VMWARE to see if they could help ( SR 13403911911 ). The tech working with me really didn't have any ideas after looking at the logs because we didn't have a scratch partition set up so when the host crashed and rebooted it was pretty much empty. Meanwhile, the VEEAM engineer pointed me towards the article that states there is a bug with ESXI 5.x concerning e1000e vnic drivers along with Windows 2012 and how they can crash hosts causing the PSOD. So I changed all the vnics on the Windows 2012 VM's to vmxnet3. Since then we have had no more pink screens, however, the veeam backup replication job seems to come to a crawl when replicating the ONE vm that has the EFI boot menu issue. The VEEAM server (which is also a VM) constantly loses it's network connectivity when the job is running. It actually brings the host and all other VM's to a crawl too. As soon as I stop the job from running, everything goes back to normal.

It seems to me that all of our problems relate to this one VM that uses EFI BIOS. I'm thinking maybe I should try deleting the NVRAM file as stated above? Any thoughts? Ideas?

Thanks.

dariusd · ‎11-21-2013

Hi abruso,

If your EFI Windows VM was ever being backed up by VB&R prior to 6.5 Patch 3, its NVRAM could very well have been corrupted along the way. Upgrading Veeam does not fix any existing corruption, it merely prevents any additional corruption from occurring. Your description suggests you are having the same problem as the others in this thread, in which case you should power down the VM, and go ahead and rename/delete its .nvram file. It should boot automatically when next powered on.

I have no information on whether this would be related to the other issues you describe with backing up your EFI VM... it is possible, I guess... I suspect you will find out once you rename/delete the .nvram file and try again.

Cheers,

--

Darius