During the upgrade of the ESX host, you may notice that the reboot time is excessive compared to the previous version of ESX.
In two areas I noted very poor performance:
- On the console, when the boot process gets to 'Advanced-Config' it appears to hang for minutes.
- In addition, post upgrade, the process of adding the server back into a HA cluster can take several minutes and often times out.
On the console you may see errors along the lines of:
- Error interacting with configuration file /etc/vmware/esx.conf: Failed attempting to lock file. Another process has locked the file for more than 10 seconds.
This is associated with process esx-config.xml -u on the upgrade process. In addition, you will see a lock file in the same directory as esx.conf.
In my situation the reason behind this was a corrupt esx.conf file which was approximately 25000 lines long and full of garbage resource pool indicators. Reviewing 'top' saw the hostd process running at 100% for the duration of the incident and errors logged to hostd.log for each incorrect resource pool:
- 2010-01-22 12:08:07.251 F66D16D0 warning 'HostsvcPlugin' Destroying unregistered VMkernel resource group 'host/user/pool0/pool4/pool7'
These errors were being corrected about once every 0.5 seconds as the esx.conf file was being parsed.
The solution was simple from my perspective but took a bit of research to get to this stage:
- Be patient - let the hostd service do its thing and attempt to resolve the issue. In my case this took over 4 hours!! Bear in mind I had spare capacity so was lucky not to affect service since my other hosts could take the load.
- Once the process has calmed down, the server should exit maintenence mode and then be part of the cluster group again.
- Remove Host from Cluster
- Remove Host from vCenter
- Check the esx.conf file - it should now have been rebuilt and be aroung 1500 lines long
- Add the server back into the cluster again..
- All should now be well - reboots shoud be timely and no unregistered resource pools should be present
By this stage the server should now be added back into HA if you have configured it (and not time out as previous), so all should now be well to start loading the server with machines as part of DRS.
Hope thats helps. If this is a known issue, remove this doc please but send me the link since no end of googling found it!!