During & post host update to version 4.0. Reboot takes excessive time and host server performs sluggishly.

During & post host update to version 4.0. Reboot takes excessive time and host server performs sluggishly.

During the upgrade of the ESX host, you may notice that the reboot time is excessive compared to the previous version of ESX.

In two areas I noted very poor performance:

  1. On the console, when the boot process gets to 'Advanced-Config' it appears to hang for minutes.
  2. In addition, post upgrade, the process of adding the server back into a HA cluster can take several minutes and often times out.


On the console you may see errors along the lines of:

  • Error interacting with configuration file /etc/vmware/esx.conf:  Failed attempting to lock file.  Another process has locked the file for  more than 10 seconds.


This is associated with process esx-config.xml -u on the upgrade  process. In addition, you will see a lock file in the same directory as  esx.conf.

In my situation the reason behind this was a corrupt esx.conf file  which was approximately 25000 lines long and full of garbage resource  pool indicators. Reviewing 'top' saw the hostd process running at 100%  for the duration of the incident and errors logged to hostd.log for each  incorrect resource pool:

  • 2010-01-22 12:08:07.251 F66D16D0 warning 'HostsvcPlugin' Destroying unregistered VMkernel resource group 'host/user/pool0/pool4/pool7'


These errors were being corrected about once every 0.5 seconds as the esx.conf file was being parsed.

The solution was simple from my perspective but took a bit of research to get to this stage:

  1. Be patient - let the hostd service do its thing and attempt to  resolve the issue. In my case this took over 4 hours!! Bear in mind I  had spare capacity so was lucky not to affect service since my other  hosts could take the load.
  2. Once the process has calmed down, the server should exit maintenence mode and then be part of the cluster group again.
  3. Remove Host from Cluster
  4. Remove Host from vCenter
  5. Check the esx.conf file - it should now have been rebuilt and be aroung 1500 lines long
  6. Add the server back into the cluster again..
  7. All should now be well - reboots shoud be timely and no unregistered resource pools should be present


By this stage the server should now be added back into HA if you have  configured it (and not time out as previous), so all should now be well  to start loading the server with machines as part of DRS.

Hope thats helps. If this is a known issue, remove this doc please but send me the link since no end of googling found it!!

Comments

I have the same problem, but with ESX 3.5 U4 servers which now report to vCenter 4 U1 (was VirtualCenter 2.5).

The esx.conf files are thousands of lines long, mostly filled with empty resource pool entries. On some servers the file is over 9MB. When I reboot the server it takes hours before it rejoins vCenter as it takes so long to parse the file.

When I called VMware last week they said they hadn't seen this before, and that I should rebuild my server. However, I've checked my hosts and 15 of them have this problem! I tried the steps above but unfortunately removing from the cluster and from vCenter didn't rebuild the esx.conf file.

Any ideas?

Hi

Did you let the hostd service finish its initial parsing of the esx.conf file?? It will run at 100% util until it does (only on one core, so its not a killer).

Tail hostd.log until its stopped logging the resource pool errors, then try the remove and re-add... Basically, wait until its back in the cluster, connected and HA is initialised again

It could be that you need to manually rebuild esx.conf so worth a look at http://bit.ly/cAGkHl.

In addition, I probably would have experienced the issue as you have if I had rebooted the ESX3.5 hosts prior to upgrade, but only found it when I had upgraded to ESX4 on the host. ESX4 is a full install anyway (rather than an 'upgrade') so you may be better just upgrading, then attempting the remove/add fix later.

Yes, I let it finish the parsing, and removed the server from the cluster, then from vCenter, but the esx.conf file still stays huge (9MB). I even created a new cluster and added it to that, but it made no difference. Out of the 15 servers in two clusters, 4 have esx.conf over 3MB and the others are 300-600KB (so also with spurious empty resource entries).

I looked at a host which was in a different cluster (which never had resource groups) and the esx.conf is a more healthy 40KB.

VMware support are looking at my logs to see what's going on, so finger's crossed they'll find something.

Version history
Revision #:
1 of 1
Last update:
‎01-23-2010 05:06 AM
Updated by: