since we updated our ESX hosts from ESX 3.5 Update 2 to ESX 3.5 Update 3 (+ Dec 2008 patches) we experience an abnormally high RAM usage of the hostd management process on the hosts.
The symptoms are that hostd hits its default hard memory limit of 200 MB and shuts itself down. This is indicated by the following error message in var/log/vmware/hostd.log:
\[2009-01-06 06:25:47.707 'Memory checker' 19123120 error\] Current value 207184 exceeds hard limit 204800. Shutting down process.
The service is automatically restarted, but as a consequence the VirtualCenter agent also restarts and the host will become temporarily unresponsive in VirtualCenter (shown as "not responding" or disconnected").
To fix the problem we increased the hard memory limit of hostd by following the recommendations we found here:
We increased the hard limit to 250 MB, but within several hours this limit was also reached on some hosts and the problem re-appeared. So, I suspect that there is ome kind of memory leak in hostd, and we need to find its cause to finally solve the problem.
According to VMware support there are all sorts of things that might cause an increased RAM usage of hostd, because there are many other processes/applications using it: The VirtualCenter agent, the HA agent, hardware management agents, and other external applications that use the web-API to talk to the host.
We have several hosts that are configured all the same and belong to the same cluster in VirtualCenter. These hosts (let's call them type A ) all show the problem. However, there is one host that is not part of the cluster, but is also managed by the same VirtualCenter instance, and this one does NOT show the problem (let's all it type B ). I hope to find out the reason for hostd's increased RAM usage by comparing these two types of hosts:
Both types are installed with ESX 3.5 Update 3 (+ Dec 2008 patches) using the same automated build.
Both are HP hardware and have the HP Management agents version 8.1.1 installed.
Both servers are monitored by the same HP System Insight Manager server that queries the installed HP management agents.
Type A is HP ProLiant DL585 (G1 or G2) with four AMD-Opteron-DualCores. Type B is a HP ProLiant DL360 G5 with two Intel-Xeon-QuadCores.
Type A has SAN connections (using QLogic adapters) and uses two NFS datastores for ISO images. Type B uses only local hard disks for VM and ISO storage.
Type A is in a DRS- and HA-enabled cluster (with EVC enabled). Type B is stand-alone.
I'm trying to find the problem's cause by the process of elimination. I already disabled HA on the cluster, and this did NOT fix the problem. Now I stopped the HP agents on one host to see if it makes a difference (although I do not expect it since both types A and B have them running).
While I'm going down this I'd like to have some input from the community that might also lead to the cause of my problem:
Anyone out there that is also experiencing high hostd RAM usage? What is your hard limit, and is your configuration comparable?
Anyone out there with configurations comparable to type A, but NOT seeing this problem (I guess there are many ...)? What might be the difference causing the problem?
Any other helpful comments? Does anyone know a way to increase the debug level of hostd? (I also asked VMware support for this, but have not yet received an answer)
You can check RAM usage of hostd by using
ps waux | grep vmware-hostd
in the service console. It outputs something like
root 19285 0.4 8.1 81576 65488 ? S 07:44 0:37 /usr/lib/vmware/hostd/vmware-hostd /etc/vmware/hostd/config.xml -u
The fat number is the RAM usage in MB. You can also check /var/log/vmware/hostd.log for messages like "... exceeds soft limit ..." (warning only) and "... exceeds hard limit ..." (will cause a service restart).
I'll keep this thread updated with all informations I find out myself or receive from VMware support. Thank you for any contributions.
added additional information, corrected formatting.
updated tags to include EMC_Controlcenter
We noticed that the hostd is eating up more and more memory, depending on the number of VMs running in a cluster. Memory in the hostd process is not released, even when the server is put into maintenance mode and no virtual machines are running on it! The only way to release the memory is to restart the service or reboot the ESX host. Why this behaviour? Why is the memory not released properly when the machine is no longer running on the ESX box?
If the hard limit is set to 200 the hostd restarts. However, the restart process sometimes fails and the host becomes disconnected permanently. There is no possibility to connect it back to virtual center by restarting the hostd again. Moreover, the hostd process appears to be spawned twice when using "service mgmt-vmware restart" and issuing top on the service console. Even a connection with the VI Client directly to the ESX host itself fails. By changing the hard limit to 400 we can avoid the problem but not solve it. If we consolidate more (say 100 to 1 on bigger servers) we will have the same problem all over again. This on a cluster with 50 to 1 virtual machines.
Another issue we are seeing is the recurrence of the following message in the hostd-log:
This results in a CPU0 time of 90% and when we trigger a vmotion for XV00XXXX to another host, the hostd jumps from 160Mb (running 15 Virtual Machines) to 220 Mb resulting in a reboot / disconnect! Strange stuff...
It looks like there are many people affected by this memory leak issue, but most of them probably do not even know that ...
PLEASE file a support request with VMware to raise the pressure on them and get this bug investigated and fixed quickly!
mine is SR# 1151768711.
VMware TechSupport is still struggling to understand what the problem is. They still think that there is EMC software installed inside the service console that causes the problem (although I told them more than once that there is no Controlcenter software installed inside the service console and only remote API calls are happening).
Yes, they even had me open a case with EMC (no. 27624004) to have EMC investigate the problem and asked me back for the answers I got from EMC. This is ridiculous given the fact that VMware is majority-owned by EMC.
Thanks for the info. We are running ESX 3.5U3 on (2x) IBM x3550 managed by VC 2.5U3 with a total of 10 VM's hosted. Our attached iSCSI San is an EMC Clarrion AX4i.
I've also noticed the hostd memory leak increasing from a default of ~86MB up to the 200MB hard limit over the course of approx 30 days - but only on one of the two servers. Both servers (to the best of my knowledge) are configured identically. The unaffected server has run over 70 days with hostd occupying no more than 86MB. On the affected box, when the hostd service exceeds the hard limit, it is shutdown and restarted. Consequentially the hosted VM's automatically reboot causing much bother with users (even though VM auto start/stop is disabled).
An excerpt from the hostd.log of the affected server: -
2009-02-03 07:55:15.577 'Memory checker' 22748080 warning Current value 204624 exceeds soft limit 122880.
2009-02-03 07:55:45.587 'Memory checker' 3076436896 warning Current value 204624 exceeds soft limit 122880.
2009-02-03 07:56:15.598 'Memory checker' 23280560 warning Current value 204624 exceeds soft limit 122880.
2009-02-03 07:56:26.632 'EnvironmentBrowser' 20896688 info Hw info file: /etc/vmware/hostd/hwInfo.xml
2009-02-03 07:56:26.634 'EnvironmentBrowser' 20896688 info Config target info loaded
2009-02-03 07:56:45.608 'Memory checker' 111623088 error Current value 204884 exceeds hard limit 204800. Shutting down process.
2009-02-03 07:56:45.618 'App' 3076436896 info END SERVICES
2009-02-03 07:56:45.638 'BlklistsvcPlugin' 3076436896 info Block List Service Plugin stopping
2009-02-03 07:56:45.658 'BlklistsvcPlugin' 3076436896 info Block List Service Plugin stopped
2009-02-03 07:56:45.674 'DirectorysvcPlugin' 3076436896 info Plugin stopped
2009-02-03 07:56:46.083 'HostsvcPlugin' 3076436896 info Plugin stopped
2009-02-03 07:56:46.093 'InternalsvcPlugin' 3076436896 info Plugin stopped
2009-02-03 07:56:46.161 'Nfc' 3076436896 info Plugin stopped
2009-02-03 07:56:46.211 'Proxysvc' 3076436896 info Stopped Proxy service
I can only recall having witnessed this problem since we moved from U2 to U3. We are not continuously running any IBM or EMC management agents on these servers. The only 3rd party item I can think of is the EMC Navisphere Utility suite - cmd line utilities for SAN registration or snapshots. I'm assuming that increasing the hard and soft limits for memory usage will only provide extra time until the leak again hits the hard limit. I will continue to compare the two servers to see if any mismatch in configuration arises and post back.
For additional information, I'm experiencing this issue on U2 (build 110268). I have no special SAN software installed but I am running HP Insight Agents 8.11. VM Load is pretty low on all servers. The latest occurence (where the server goes disconnected in Virtual Center and then reconnects) was on a quad processor server /16GB of RAM only supporting three VMs accounting for a total of 3GB. Service Console Memory (on all of my ESX Servers) is at 512MB, and has been for a long time.
Here is a short update on this issue:
We noticed that on one of the two NFS datastores we have mounted on the ESX hosts there was a very deeply nested directory structure with more than 200.000 files in it. The ECC agent obviously browsed all these directories and files which took a very long time and consumed lots of hostd memory. We removed the files from the datastore (it is indeed intended for storing some ISO-files only) and our problem was gone...
Anyway, the hostd behaviour looks buggy in this case: It should return the memory that the browsing query consumes once the query finishes (even if it times out or errors out), but it does not.
VMware support admitted that there is a memory leak problem in hostd, and it looks like they have received other support requests which describe the same problem, but probably may be traced backed to different causes. However, it might be worth to check if you have large directory/file-structures somewhere on your datastores that are probably browsed by some third party tool.
Good catch on the NFS datastore thing. We too have a large NFS datastore that we mount on all our hosts. We have been experiencing the same issue with hostd hanging while ECC is browsing the datastores. We have also seen hostd flakyness for other reasons as well. VMware please fix this!
Mother's don't let your children do production support for a living!
Concerning the memory leak in hostd: it appears that there is an issue caused by OpenPegasus / CIMServer. This will be fixed in ESX 3.5 U5 (yes, U5). This issue is seen more frequent when using HP Agents as it integrates with OpenPegasus.
I got this feedback to disable the openpegasus / CIMserver:
I wanted to follow up with you on this case regarding these Hosts PSOD/Hanging. The problem report that I added this SR to prior to open a storage case has been addressed as a memory leak in ESX 3.5U3 and will be corrected in the ESX3.5 U5 release scheduled for April. This will correct these problems you have experienced with regards to the PSODs however please follow the workaround below in the event that you still have concerns until this patch is available.
Disable Pegasus from starting on reboot.
Note: This will affect the reporting of 3rd party SIM server tools such as HP Insight manager server. This protects against problems that might otherwise be caused by the memory leak known to pegasus in the ESX console.
pegasus 0:off 1:off 2:off 3:on 4:off 5:on 6:off
#Confirm that they are disabled (off)
pegasus 0:off 1:off 2:off 3:off 4:off 5:off 6:off
The root of this case has been identified and will be corrected permanently in the April release. At this stage there is nothing more I can provide other than to check an update at . I will move to close this case today as the workaround has been provided however please do not hesitate to contact VMware if you require immediate attention and open a new SR if required.
So, how are you monitoring the hardware? Is it just a wait and pray thing?
VMware vExpert 2009
Careful. We don't want to learn from this.
Bill Watterson, "Calvin and Hobbes"
One way to prevent this is to avoid having the monitoring server 'probe' the ESX server and instead the ESX server sends SNMP traps to the monitoring server. The traps are converted to alerts, emails, etc to the administrator.
What is impacted by turning off the Pegasus service? At this point with the issue that I am having I am not concerned about monitoring the hardware for a little while. Can vCenter still see the hardware under Configuration | Health Status?
If you found this information useful, please consider awarding points for "Correct" or "Helpful".
Remember, if it's not one thing, it's your mother...
I've created a little dos batch script to get a list of ESX server and their vmware-hostd current usage. So now you can determine which servers need a vmware-hostd restart manually.