During the upgrade of the ESX host, you may notice that the reboot time is excessive compared to the previous version of ESX.
In two areas I noted very poor performance:
On the console you may see errors along the lines of:
This is associated with process esx-config.xml -u on the upgrade process. In addition, you will see a lock file in the same directory as esx.conf.
In my situation the reason behind this was a corrupt esx.conf file which was approximately 25000 lines long and full of garbage resource pool indicators. Reviewing 'top' saw the hostd process running at 100% for the duration of the incident and errors logged to hostd.log for each incorrect resource pool:
These errors were being corrected about once every 0.5 seconds as the esx.conf file was being parsed.
The solution was simple from my perspective but took a bit of research to get to this stage:
By this stage the server should now be added back into HA if you have configured it (and not time out as previous), so all should now be well to start loading the server with machines as part of DRS.
Hope thats helps. If this is a known issue, remove this doc please but send me the link since no end of googling found it!!
I have the same problem, but with ESX 3.5 U4 servers which now report to vCenter 4 U1 (was VirtualCenter 2.5).
The esx.conf files are thousands of lines long, mostly filled with empty resource pool entries. On some servers the file is over 9MB. When I reboot the server it takes hours before it rejoins vCenter as it takes so long to parse the file.
When I called VMware last week they said they hadn't seen this before, and that I should rebuild my server. However, I've checked my hosts and 15 of them have this problem! I tried the steps above but unfortunately removing from the cluster and from vCenter didn't rebuild the esx.conf file.
Any ideas?
Hi
Did you let the hostd service finish its initial parsing of the esx.conf file?? It will run at 100% util until it does (only on one core, so its not a killer).
Tail hostd.log until its stopped logging the resource pool errors, then try the remove and re-add... Basically, wait until its back in the cluster, connected and HA is initialised again
It could be that you need to manually rebuild esx.conf so worth a look at http://bit.ly/cAGkHl.
In addition, I probably would have experienced the issue as you have if I had rebooted the ESX3.5 hosts prior to upgrade, but only found it when I had upgraded to ESX4 on the host. ESX4 is a full install anyway (rather than an 'upgrade') so you may be better just upgrading, then attempting the remove/add fix later.
Yes, I let it finish the parsing, and removed the server from the cluster, then from vCenter, but the esx.conf file still stays huge (9MB). I even created a new cluster and added it to that, but it made no difference. Out of the 15 servers in two clusters, 4 have esx.conf over 3MB and the others are 300-600KB (so also with spurious empty resource entries).
I looked at a host which was in a different cluster (which never had resource groups) and the esx.conf is a more healthy 40KB.
VMware support are looking at my logs to see what's going on, so finger's crossed they'll find something.