wolfwolf
Contributor
Contributor

ESXi lost configuration on reboot

Hi,

Recently we've installed ESXi 4.1 Update 1 embedded version on the internal SD card (SanDisk 2GB) for a few Dell PowerEdge M610 blade servers using the ESXi recovery CD from Dell, also available for download from the VMware site:

http://downloads.vmware.com/d/details/dell_esxi_recoverycd/ZHcqYip3ZWJkKmV3

Yesterday we rebooted the servers to make sure everything would come back up.

To our surprise when ESXi come back up on each server, the whole configuration was gone: IP address, server name, password, everything was lost, the configuration was back as if we just installed ESXi.

Any ideas why that happened?

We did a proper reboot for ESXi: one of the servers was rebooted using vCenter, the others were rebooted using the ESXi DCUI using the F11 key.

Before the reboot, one of the ESXi server was up with its configuration for a few days and was managed through vCenter, the others were installed an hour or so before the reboot and weren't connected to vCenter yet: no matter how long they were up, all of them lost the configuration after the reboot.

And we've been unable to reproduce the issue so far, we've reconfigured ESXi on a couple of those servers today and rebooted them from the ESX DCUI: each of them came back online just fine this time, the configuration is still there.

We didn't reconfigure all of them, let me know if there is anything we can check on those still untouched after the reboot to tell how come they lost the configuration.

Thanks.

0 Kudos
45 Replies
krishnaprasad
Hot Shot
Hot Shot

Hello,

   I do see a Dell document which talks about the compatibility of USB keys supported on Dell Power Edge Servers. ttp://support.dell.com/support/edocs/software/eslvmwre/VS/docs/compat/Com_Mat.pdf

As per this document, the SanDisk 2GB card is not supported on an M610 system. Worth trying with a Kingston 2GB / 1GB SD card? 

This doesnt explain why the failure is seen. But just a pointer if you are using an unsupported configuration.

Might be useful if the logs can be captured at the time of system reboot when the failure is occurred.

Thanks,

Krishnaprasad

0 Kudos
wolfwolf
Contributor
Contributor

krishnaprasad wrote:

   I do see a Dell document which talks about the compatibility of USB keys supported on Dell Power Edge Servers. ttp://support.dell.com/support/edocs/software/eslvmwre/VS/docs/compat/Com_Mat.pdf

As per this document, the SanDisk 2GB card is not supported on an M610 system. Worth trying with a Kingston 2GB / 1GB SD card? 

This doesnt explain why the failure is seen. But just a pointer if you are using an unsupported configuration.

Might be useful if the logs can be captured at the time of system reboot when the failure is occurred.

We've already called Dell and for some reason they couldn't find the Kingston 2GB SD card matching the part number (738M1) from that PDF document, anyway we've ordered from them what appears to be the same item, just with a different part number, we should be getting it soon.

Meanwhile, any ideas on what we could do to further investigate the issue, in case it's not related to the SD card compatibility?

We still have access to the systems that lost the ESXi configuration on reboot.

Thanks.

0 Kudos
krishnaprasad
Hot Shot
Hot Shot

it looks like the /bootbank was not updated in your faulty systems. /bootbank contains all the .tgz files where the state of the ESXi is stored.  

There is a script in ESXi called backup.sh which takes care of updating the state of the ESXi ( like network re-configuration if any etc ) in /bootbank. The backup.sh normally runs in every hour i think. This should have ideally taken care of updating required configuration files modified as part of updating netwrk info, password etc ( User configurable settings in ESXi ). But In your case, it looks like the backup script may have failed to update the state of the SD card with the new configurations. can you run backup.sh manually on those faulty system and see the time it takes with the sandisk SD card?

So replacing with the supported SD model will be ideal and see if you can reproduce the issue?

Thanks,

Krishnaprasad

0 Kudos
wolfwolf
Contributor
Contributor

On every system we've replaced the SanDisk 2GB SD card with a Kingston 2GB SD card ordered from Dell, that should be supported.

And we can still reproduce the issue on all systems.

It appears that the configuration is only lost on the first reboot: after the first reboot, if you reconfigure the system and reboot it again, the configuration will stick just fine. But we're still not very confident putting these systems into production, until we figure out what's causing the problem.

Also, the issue appears to only affect Dell ESXi ISO, we cannot reproduce it with VMware ESXi ISO (but we need the Dell version, since the VMware version doesn't include the drivers for the Intel 10Gbit NIC).

Any ideas on what to try next to figure out the issue?

Thanks.

0 Kudos
krishnaprasad
Hot Shot
Hot Shot

Interesting. can you check if running the command /sbin/auto-backup.sh solves the problem?

i.e. right after first boot, do the necessary configuration changes like password set, change IP etc. Then before reboot, run the above command and see if it makes any difference? also if you dont mind, post the output of the command from your system.

http://deinoscloud.wordpress.com/2010/02/17/esxi-automatic-backups/

0 Kudos
Dave_Mishchenko
Immortal
Immortal

So you're following this process: 1) boot Dell ISO 2) install 3) reboot 4) configure esxi 5) reboot 6) loss config changes As suggested I would manually run the backup before step 4 or 5 and also check that bootbank is properly mounted at that point.

0 Kudos
antonvn
Contributor
Contributor

Colleagues,

We hit exactly the same issue with M610, ESXi4U1 (Dell patched) and 2Gb SanDisk cards last week.

Depending on luck, afer 1-3 configuration attempts followed by a reboot, system "memorizes" its settings (IP, password etc). We tried to push configuration save with /sbin/auto-backup.sh 0 /bootbank/ (run few times, then sync; sync; reboot). What became clear is sometimes /bootbank/ gets re-initialized, and we see /altbootbank/ partition as well. Also we tried to install ESXi onto a system 20Gb SATA SDD drive, with the same result.

Once settings are "remembered" after a reboot, all consecutive reboots did not flush the configuration.

We still not happy to put affected systems to production.

Anton.

0 Kudos
bulletprooffool
Champion
Champion

Morning,

This is definteily something we have seen before and in almost all cases was related to bootbank not backing up as it should (genreally a faulty filesystem)

I have a blog post for this with a fix and a more complete explanation of what is happening:

http://www.get-virtual.info/2011/02/02/esx-host-losing-settings-at-reboot-checking-system-partitions...

One day I will virtualise myself . . .
0 Kudos
wolfwolf
Contributor
Contributor

Dave Mishchenko wrote:

So you're following this process: 1) boot Dell ISO 2) install 3) reboot 4) configure esxi 5) reboot 6) loss config changes As suggested I would manually run the backup before step 4 or 5 and also check that bootbank is properly mounted at that point.

Yes, that's exactly the process we followed:

1) Boot Dell ESXi installable edition ISO

2) Install ESXi

3) When ESXi installation is completed, press enter to reboot

4) Configure ESXi: network, etc.

5) Reboot from ESXi DCUI with F12 key (let's call this first reboot from now on)

6) ESXi comes back up and all configuration appears to be lost

As per your suggestion, we also tried to run the /sbin/auto-backup.sh script between step 4 and step 5, it worked fine, but it didn't help with the problem.

Also, if after step 6, we reconfigure ESXi again and reboot, then the configuration stays in place: the issue appears to happen just for the first reboot (after the first time ESXi is booted on the system).

We've been able to reproduce the issue even by installing the Dell ESXi installable edition in a VM (on VMware Fusion).

If you want to give it a try, here is the Dell ESXi installable edition ISO:

http://support.dell.com/support/downloads/download.aspx?c=us&cs=555&l=en&s=biz&releaseid=R297945&Sys...

http://ftp.us.dell.com/esg%20solutions/VMware-VMvisor-Installer-4.1.0.update1-348481.x86_64-Dell_Cus...

The issue doesn't happen with the VMware ESXi installable edition ISO, there is definitely something wrong with the Dell ISO.

We did some comparison between between ESXi from Dell installed on a VM and ESXi from VMware installed on another VM and we found some differences:

- On VMware ESXi, the initial /bootbank and /altbootbank partition remain the same after first reboot. On Dell ESXi, the initial /bootbank partition becomes the /altbootbank partition and the /altbootbank partition becomes the /bootbank partition after the first reboot (we verified this by creating a test file with a different name in each partition before the reboot).

- When ESXi is started for the first time after the installation, on VMware ESXi the /altbootbank partition is almost empty, it only contains the boot.cfg file, on Dell ESXi the /altbootbank partition appears to have all the files also in the to the /bootbank partition (even if running a recursive diff shows differences between two file sets)

- On Dell ESXi, when ESXi is started for the first time, it appears that the modified date of files in the /altbootbank partition are more recent than the files in the /bootbank partition, except the for the /bootbank/local.tgz file, that's also the file which gets updated if we manually run the /sbin/auto-backup.sh script, like if the /altbootbank partition was actually the active partition, except for the config file. Also, in the process above, if after step 5, we press Shift+R when ESXi is loading, ESXi comes back online with the correct configuration!

Any ideas on what's wrong with Dell ISO? How come the partitions gets switches at the first reboot?

How does ESXi pick up the partition that becomes active?

Thanks.

0 Kudos
DSTAVERT
Immortal
Immortal

See what happens with a recovery boot. Try Shift + R when the loading Hypervisor screen appears. That should give you the ability to go to the previous update/version.

-- David -- VMware Communities Moderator
0 Kudos
wolfwolf
Contributor
Contributor

David Stavert wrote:

See what happens with a recovery boot. Try Shift + R when the loading Hypervisor screen appears. That should give you the ability to go to the previous update/version.

As per my previous message, in the process above, if after step 5, we press Shift+R when ESXi is loading, ESXi comes back online with the correct configuration!

I'm still not sure what's broken with Dell ISO and if's safe to assume that the issue only affects the first reboot.

Thanks.

0 Kudos
DSTAVERT
Immortal
Immortal

Missed that part. I woiuld go through and make configuration changes and restart to see whether you have a viable install.

-- David -- VMware Communities Moderator
0 Kudos
Dave_Mishchenko
Immortal
Immortal

Boot.cfg is read in both Hypervisor1 and Hypervisor2 to determine which has the variable updated = 2.  That get's mounted as bootbank.  Boot.cfg in altbookbank has a value of update = 1. It may be that the Dell image includes some scripts, etc that they just want run once and that their cleanup process to deal with those scripts is not properly dealing with the bootbanks.

0 Kudos
DSTAVERT
Immortal
Immortal

Just out of curiosity did you do an MD5 checksum on the downloaded iso?

-- David -- VMware Communities Moderator
0 Kudos
wolfwolf
Contributor
Contributor

Yes, the MD5 sum is correct.

And we can reproduce the issue installing Dell ESXi on our M610 servers and also in a VM on VMware Fusion.

0 Kudos
DSTAVERT
Immortal
Immortal

I can also confirm that same behavior using your Dell download link. Shift + R does boot the correct bootbank and a subsequent reboot does come up in a configured state.

Does this server have the dual SD card slots?

-- David -- VMware Communities Moderator
0 Kudos
Dave_Mishchenko
Immortal
Immortal

oem.tgz on Hypervisor1 contains a tweak for MD36xxi support in the form of a VIB which means that /altbootbank being set as the active boot partition for the next reboot is correct.  They should have included a reboot command in  the customization. Since the host doesn't reboot  your changes get written to /bootbank either when you reboot or when the auto backup process runs (1 min past the hour).

0 Kudos
krishnaprasad
Hot Shot
Hot Shot

That's exactly the case Dave.

In this case, 'updated' variable in /bootbank becomes 1 and same in /altbootbank becomes '2'. hence in the next bootup the bootup will uncompress files from /altbootbank which dont carry the changes like password change, IP changes etc.

As you mentioned, Dell carries an init script for configuring their MD36xxi array and during clean up of the VIB ( which contains the script ) via esxupdate, this scenario happens. 

I also see an esxupdate command, 'esxupdate clearpending' executed in a VMware script 130.-x.x... Does this removes the locks or sync the VIB database in SYNC?

So other than rebooting the server to make the changes effective ( removal of the VIB ), is there any other way to remove the VIB from the init script but still the updated variable can be kept as '1' for altbootbank and '2' for bootbank? 

BTW, wolfwolf , you can Call Dell tech support so that they can change the ISO images.

0 Kudos
wolfwolf
Contributor
Contributor

Dave Mishchenko wrote:

oem.tgz on Hypervisor1 contains a tweak for MD36xxi support in the form of a VIB which means that /altbootbank being set as the active boot partition for the next reboot is correct.  They should have included a reboot command in  the customization. Since the host doesn't reboot  your changes get written to /bootbank either when you reboot or when the auto backup process runs (1 min past the hour).

Thanks for the explanation.

At this point, is it safe to go with the Dell ISO (doing a manual first reboot, after the initial boot, before applying any configuration) or is it better to go with the stock VMware ISO to avoid any risks of losing the configuration at future ESXi reboots/updates?

Besides the Intel 10G NIC driver missing from the VMware ISO, what are the advantages of the Dell ISO?

It looks like you can download the Intel 10G NIC driver from VMware site and apply it to the stock ESXi installation from VMware.

If you do so, will the driver be carried over when applying future ESXi updates to the host?

About the Dell ISO bug, what would be the best way to report it to Dell? Any specific email address?

Thanks.

0 Kudos