since we updated our ESX hosts from ESX 3.5 Update 2 to ESX 3.5 Update 3 (+ Dec 2008 patches) we experience an abnormally high RAM usage of the hostd management process on the hosts.
The symptoms are that hostd hits its default hard memory limit of 200 MB and shuts itself down. This is indicated by the following error message in var/log/vmware/hostd.log:
\[2009-01-06 06:25:47.707 'Memory checker' 19123120 error\] Current value 207184 exceeds hard limit 204800. Shutting down process.
The service is automatically restarted, but as a consequence the VirtualCenter agent also restarts and the host will become temporarily unresponsive in VirtualCenter (shown as "not responding" or disconnected").
To fix the problem we increased the hard memory limit of hostd by following the recommendations we found here:
We increased the hard limit to 250 MB, but within several hours this limit was also reached on some hosts and the problem re-appeared. So, I suspect that there is ome kind of memory leak in hostd, and we need to find its cause to finally solve the problem.
According to VMware support there are all sorts of things that might cause an increased RAM usage of hostd, because there are many other processes/applications using it: The VirtualCenter agent, the HA agent, hardware management agents, and other external applications that use the web-API to talk to the host.
We have several hosts that are configured all the same and belong to the same cluster in VirtualCenter. These hosts (let's call them type A ) all show the problem. However, there is one host that is not part of the cluster, but is also managed by the same VirtualCenter instance, and this one does NOT show the problem (let's all it type B ). I hope to find out the reason for hostd's increased RAM usage by comparing these two types of hosts:
Both types are installed with ESX 3.5 Update 3 (+ Dec 2008 patches) using the same automated build.
Both are HP hardware and have the HP Management agents version 8.1.1 installed.
Both servers are monitored by the same HP System Insight Manager server that queries the installed HP management agents.
Type A is HP ProLiant DL585 (G1 or G2) with four AMD-Opteron-DualCores. Type B is a HP ProLiant DL360 G5 with two Intel-Xeon-QuadCores.
Type A has SAN connections (using QLogic adapters) and uses two NFS datastores for ISO images. Type B uses only local hard disks for VM and ISO storage.
Type A is in a DRS- and HA-enabled cluster (with EVC enabled). Type B is stand-alone.
I'm trying to find the problem's cause by the process of elimination. I already disabled HA on the cluster, and this did NOT fix the problem. Now I stopped the HP agents on one host to see if it makes a difference (although I do not expect it since both types A and B have them running).
While I'm going down this I'd like to have some input from the community that might also lead to the cause of my problem:
Anyone out there that is also experiencing high hostd RAM usage? What is your hard limit, and is your configuration comparable?
Anyone out there with configurations comparable to type A, but NOT seeing this problem (I guess there are many ...)? What might be the difference causing the problem?
Any other helpful comments? Does anyone know a way to increase the debug level of hostd? (I also asked VMware support for this, but have not yet received an answer)
You can check RAM usage of hostd by using
ps waux | grep vmware-hostd
in the service console. It outputs something like
root 19285 0.4 8.1 81576 65488 ? S 07:44 0:37 /usr/lib/vmware/hostd/vmware-hostd /etc/vmware/hostd/config.xml -u
The fat number is the RAM usage in MB. You can also check /var/log/vmware/hostd.log for messages like "... exceeds soft limit ..." (warning only) and "... exceeds hard limit ..." (will cause a service restart).
I'll keep this thread updated with all informations I find out myself or receive from VMware support. Thank you for any contributions.
added additional information, corrected formatting.
updated tags to include EMC_Controlcenter
I am running ESX 3.5 u3 (build 130756) on a test environment, both my ESX nodes use:
root 1659 0.6 25.0 158068 127916 ? S 2008 269:45 /usr/lib/vmware/hostd/vmware-hostd /etc/vmware/hostd/config.xml -u
root 1654 0.7 26.8 168132 137004 ? S 2008 339:59 /usr/lib/vmware/hostd/vmware-hostd /etc/vmware/hostd/config.xml -u
I am seeing the "Memory checker" warnings (since it is above 122880), but not the errors.
On another test environment (build 110181):
root 17925 0.6 15.3 148888 122780 ? S 2008 134:02 /usr/lib/vmware/hostd/vmware-hostd /etc/vmware/hostd/config.xml -u
root 28728 0.8 12.7 129868 101868 ? S 2008 124:05 /usr/lib/vmware/hostd/vmware-hostd /etc/vmware/hostd/config.xml -u
Here also the warnings, not the errors. Both environments are very different (one FC setup one parallel SCSI setup and one whitebox other supported SUN hardware), so I do not think it is a memory leak issue. But what is causing it??? Also not very nice, the warnings pretty much fill up the logs....
where have you seen similar problems? Do you have any more information on this?
Have you tried uninstalling the HP Insight Agents? It seems that each new version fixes one issue and creates a whole new issue...
If the automobile had followed the same development cycle as the computer, a Rolls-Royce would today cost $100, get a million miles per gallon, and explode once a year, killing everyone inside.
Robert X. Cringely, InfoWorld magazine
Thank you for your answer.
Do you have HP hardware and the management agents 8.1.1 installed?
Same problem here. We have only Dell Poweredge R900 in our ESX infrastructure, and we have the memory problem just with some of them (just warning, not error). We have the default configuration for the hard limit.
Bye and thanks,
I think the problem could be the HP hardware and the current bug with the CIM agent reporting to VMware Health Status a ProcHot warning. This disussion claims no fix for the bug yet.: http://communities.vmware.com/thread/159063?tstart=0&start=0
I have just filed an SR on Monday to resolve the same issue you are seeing where hostd hits a hard limit. Our VMware support person is suggesting that SCSI reservation errors may be partly to blame, because we have a lot of disk contention during our backups at night. Customer has many vRanger backups and vReplicator backups going at the same time. We have not yet cut back on the backups or Svmotioned VMs to new LUNS to resolve the contention issue, so I'm not sure if that is the problem or not.
Are there any HP Intell server people seeing this same bug? Is anyone else for that matter seeing this bug that doesn't have HP AMD servers?
Will report back with what I find out...
I'm having this and other problems since upgrading to U3. I have the hosts not responding, VCB problems and Vmotion problems. Everything was solid with U2 and all the problems started with U3. I have this on HP DL580 G5 servers and DL460 blades. It seems like the hosts go to sleep. The first attempt at VMotion or VCB will fail, but from that point on everything works OK until I leave them alone for a while. I've been working on it in my lab and I have increased the the host memory to 512 meg, which did not fix it. I switched the COS and VMotion NICS to Intel from Broadcom, still no fix. The last thing I did was upgrade the HP agents to 8.1.1, this might have fixed it, but I'm not sure yet. I noticed in HP's release notes that 8.1.1 support U3. So put me down as HP hardware, Intel CPUs, Emulex HBAs.
I've seen this problem in U3 and in previous releases. The first time I had this problem was due to using the EMC Control Center software. It was constantly connecting to all of our ESX systems even though most of our systems weren't set for it. We would get continual failed logins in the hostd.log. It seems that each failure would eat a little more memory and never get released until hostd crashed and restarted. The second time I saw this was because a couple of vms were stuck in a state of reconfiguring themselves over and over.
We are seeing the same problem. We are using DL580 G4 /G5 hardware running on ESX 3.5 U2 with 8.1 agents. The vmware-hostd hits the hard limit of 200 Mb and stops.
The strange thing is that we have been running fine since update 2, nothing has changed. One cluster is running 50 to 1 (30 to 1 on the others) and that is the cluster suffering from the hostd problem. So I assume it has something to do with the amount of VMs running on the box.
We are investigating on using (and opened a support case for this one):
Edit the config.xml under /etc/vmware/hostd and add the following into the <config> section:</s<em>
I will keep you updated!
The host on which I stopped the HP agents looked good for the first four hours: hostd RAM usage stayed about 80 to 90 MB. But over night it climbed over 200 MB again. So, the HP agents are at least not the only reason for the increased RAM usage.
I also do not see a relation with the Hardware Health display in VirtualCenter: We have theProcHot warning on our DL585 G2 servers, but not on the DL585 G1 servers. However, the hostd RAM usage problem occurs on both of them.
Thanks for this hint.
We also have queries from EMC control center to the hosts. The hosts are configured for these queries and have a local user that is used for this and is succesfully logged in (and out).
However, I noticed in hostd.log that these queries sometimes take very long (> 2 hours!), and shortly before the user logs off again hostd RAM usage jumps up by about 40 MB! So, this is definitely impacting hostd RAM usage ...
The strange thing is that this memory is not released when the EMC user logs off. In fact I have never seen hostd RAM usage going down again, it's ever increasing ...
To me this looks like a leak in hostd that is not releasing RAM after certain types of operations.
... up to about 40. It is normal, that hostd RAM usage increases with the number of VMs that you run on the host
However, the problem I described also occurred on one host while it was running only 1 (one) VM.
we finally identified the trigger for the increased RAM usage of hostd: It's not the HPIM agent, but the queries of the EMC Controlcenter Agent software.
No software is installed for this inside the service consoles; we just created a user with administrative rights on each ESX host that is used for remote logins by EMC Controlcenter.
This morning at 4:00 AM the user logged in and ran its queries (which obviously involves browsing all datastores) and logged out at around 6:15 AM. During this time hostd RAM usage climbed from 90 MB to 212 MB. Before and after the user's login hostd RAM usage stayed nearly unchanged.
Please note that this problem did NOT occur with ESX 3.5 Update 2, although we already used EMC Controlcenter then. It started with our upgrade to ESX 3.5 Update 3, and the EMC Controlcenter software was not changed recently.
You can find more information on EMC Controlcenter here:
We have suspended the EMC Controlcenter queries for now, thus eliminating the cause of the problem, and I asked VMware support how to proceed with this issue.
I will post any more information here. For now the bottom line is: Don't let EMC Controlcenter query your ESX hosts if they are on 3.5 Update 3.
We are also having the same issue, yet we don't have the EMC Control Center software querying our ESX hosts... while this probably consumes memory, this is for sure not the only reason why people are seeing the soft/hard limit getting hit.
We noticed something interesting in our test & production environments. Every time we vMotion a new virtual machine (i.e. one that did not previously run on that particular host) onto a host, the hostd process memory usage increases by 1 MB, sometimes 2 MB of RAM. When we move the virtual machine away, the memory is not released. In our production clusters (and we have several of those), we have way over 200 VM's, so in combination with heavy load and DRS, it is very possible that a lot of virtual machines visit many hosts... leading to hitting the hostd memory barrier.
It is not clear to us why there would be such a thing as a "memory hard limit". Is this to disguise a memory leak? Just let the process' memory usage grow and when it gets to large, kill it to start over again? We did notice that on a restart, the hostd memory usage climbs up to 70 MB (as a starting value) + 1MB (sometimes 2MB) per VM running on that host... thus clearing the "history memory usage" of previously visited machines for the hostd proces.
That is a workaround that we are implementing now, but unfortunately, sometimes the hostd process refuses to start again, leading to multiple hostd processes running on the ESX host... which goes totally bananas at that point. We have a SR currently open but there is not really much progress there.
you are right. It is certainly not the EMC agent that originally causes the problem, since it only does read-queries to the host (via the SDK and CIM) that can be considered "legal".
The problem is hostd not correctly releasing memory again. We also noticed that hostd RAM usage climbs up with every new VM on the host, but does not drop anymore, even if we put the host into maintenance mode and move all VMs away from it. So, this *is* a memory leak in hostd that can probably be triggered by other operations as well.
One reason why VMware has implemented soft and hard memory limits for processes (it's not only hostd, but also the VirtualCenter agent process that has such limits) is probably security. DOS attacks typically lead to increased memory usage of the attacked processes, and shutting them down would prevent the attack.
It is interesting though that in ESX*i* the hostd memory check is apparently disabled by default. If you look at /etc/vmware/hostd/config.xml on an ESXi host you will find the following interesting lines there:
<!-- Frequency of memory checker -->
<!-- Disabled pending resolution of stack size issue -->
<memoryCheckerTimeInSecs> 0 </memoryCheckerTimeInSecs>
I have not tried to use these lines on an ESX host, and I don't know if the comment (stack size issue?) also applies to ESX ...