VMware Cloud Community
kipz
Contributor
Contributor

bootbank corruption on SAN-bootable ESXi 5

hello,

does anybody have any experiences with SAN-bootable ESXi 5 on Cisco UCS B200 blades?

i have recently started to experience very nasty behavior. we are using UCS 2.0 with B200 M2 blade servers using M81KR FCoE converged network adapters. all blades are booting from EMC VNX storage (over FC). everything worked well until we started to patch those servers using VUM.

after applying patches to servers and rebooting them ESXi usually does not connect back to vCenter any longer. it stays in "diconnected" state even KVM shows that it is up and running. after manual re-connection usually some configration data is lost. for example some recently created port groups or recently added patches. and DVS will go out-of-sync between ESXi and vCenter. result might be different every time. put in most cases, some patches are also lost. and if i try to re-apply those missing patches i receive error that esxupdate failed. and from /var/log/esxupdate.log i can see following errors:

2012-04-30T09:27:59Z esxupdate: esxupdate: ERROR: An esxupdate error exception was caught:
2012-04-30T09:27:59Z esxupdate: esxupdate: ERROR: Traceback (most recent call last):
2012-04-30T09:27:59Z esxupdate: esxupdate: ERROR:   File "/usr/sbin/esxupdate", line 216, in main
2012-04-30T09:27:59Z esxupdate: esxupdate: ERROR:     cmd.Run()
2012-04-30T09:27:59Z esxupdate: esxupdate: ERROR:   File "/build/mts/release/bora-608089/bora/build/esx/release/python-2.6-lib-zip-stage/608089/visor/pylib/python2.6/site-packages/vmware/esx5update/Cmdline.py", line 144, in Run
2012-04-30T09:27:59Z esxupdate: esxupdate: ERROR:   File "/build/mts/release/bora-608089/bora/build/esx/release/python-2.6-lib-zip-2012-04-2012-04-30T09:27:59Z esxupdate: esxupdate: ERROR:   File "/build/mts/release/bora-608089/bora/build/esx/release/python-2.6-lib-zip-stage/608089/visor/pylib/python2.6/site-packages/vmware/esximage/Installer/BootBankInstaller.py", line 614, in CheckBootState
2012-04-30T09:27:59Z esxupdate: esxupdate: ERROR: InstallationError: ('', 'Current bootbank /bootbank is not verified and most likely a serious problem was encountered during boot, it is not safe to continue altbootbank install. bootstate is 2, expected value is 0.')
2012-04-30T09:28:03Z esxupdate: BootBankInstaller.pyc: INFO: Unrecognized value "title=Loading VMware ESXi" in boot.cfg
2012-04-30T09:28:04Z esxupdate: HostImage: DEBUG: Live image has been updated but /altbootbank image has not.  This means a reboot is not safe.

so, somehow my bootbanks got corrupted and my ESXi hosts will come up with some old state of configuration. and i'm experiencing this on all of my servers. so basically, i can manually restore configuration but i can't reboot my hosts any longer. next reboot and configration is lost again.

i opened SR in VMware support and they are investigating this, but there is no resolution so far. so i thought that maybe i'm not the only one in the world having somethin like that. does anybody have any ideas?

regards,

kipz

0 Kudos
8 Replies
kmitchguru
Contributor
Contributor

Did you ever figure out what was going on?  I have the same problem.

0 Kudos
tidaltides
Enthusiast
Enthusiast

Same thing happened to me earlier. Any update on this?

0 Kudos
kipz
Contributor
Contributor

hello,

yes, as i mentioned i had SR opened with VMware support. we did countless number of different testing and finally they admit that it is a bug in ESXi. GA version of ESXi 5.0.0 work fine. but already first updates (specifically ESXi500-201109401-BG) generated this issue.

engineering came out with debug patch (build 7xxxxx something) which actually fixed that issue. but as this was just debug patch it was not supported on production environment. so i'm currently running unpatched GA version and waiting for the official patch which should be released (according to support) in Q3 or Q4. so half a year to go...

kipz

0 Kudos
kipz
Contributor
Contributor

but just for a curiosity, do you have also Cisco UCS blades or you have just SAN-boot? i'd like to know the cause of the issue (support did not enlighten me on that).

kipz

0 Kudos
kmitchguru
Contributor
Contributor

Yes, i'm using cisco blades (the B200-M1/B200-M2/B230M2 specifically using the M81KR converged nic).

I've done a fiar amount of experimentation and belive its the boot from san disk that is causing the problem.

If I install 5.0U1 onto a dedicated san disk (not using  auto deploy... just using ucs blades and creating a boot volume per  blade containing ESXi) then the bootbanks (where the config is stored)  gets mapped to the ram disk as opposed to the hard disk.  This means  that everytime I reboot my config gets lost.

With 5.0GA the bootbanks go to the disk but when I update to 5.0U1  (using esxcli) then after its done we are back at a ram disk.  So, it  seems that something in 5.0U1 (or one of the patches between 5.0GA and  5.0U1 causes this to occur).

Also when starting with 5.0GA and using VUM to update, the bootbanks  seem to get corrupted during the process and the updates fail rendering  the install trashed.

If I put local disks in the blade then it all works fine... but to  support stateless ucs blades i'd really like to use the boot from san  option.  It seems as though 4.x and 5.0GA generally works fine.  Its  some update after 5.0GA that causes everything to go south.

I've seen a few other reports of this using DELL servers as well so I don't think the UCS has anything to do with it... just something changed wrt the way san disks are treated between 5.0GA and 5.0U1... I suppose I can setup autodeploy and use that to push down configs, but i'd rather not have to go through that trouble.

0 Kudos
mikko201110141
Contributor
Contributor

Hi,

Has there been any news on this bug / SR? I seem to have hit it as well, on our new UCS B200M3 -servers with Cisco Integrated vic 1240.

0 Kudos
kipz
Contributor
Contributor

hi,

for me it appeared now also with GA release of ESX 5. so i re-opened this case again with VMware support and requested quicker patch. they took this patch development into their work queue and promised to release it as quickly as possible. they mentioned mid-August. hope to have it quicker. but as far as i understood it will be publicly available not before late Q3/Q4. so if you need it quicker you probably must open SR too to get that private patch within summer period.

regards,

kipz

0 Kudos
kipz
Contributor
Contributor

for some servers i followed one of support's recommendations and deployed VMware AutoDeploy. so instead of SAN-boot i use PXE-booted stateless ESXi servers. so there is not bootbanks anymore to be corrupted. but unfortunately i cannot use AutoDeploy everywhere, so i still need that patch.

0 Kudos