Dear all,
our ESXi environment is facing a rather annoying problem. After a indeterminable uptime (around one week), the hosts do no longer respond to ssh/http(s) requests, rendering them un-managable. Looking at /var/log/messages, errors like
May 4 10:07:36 vmkernel: 4:00:30:58.966 cpu1:4808)WARNING: Tcpip_Socket: 1619: socreate(type=1, proto=6) failed with error 55
or
May 4 10:07:36 vmkernel: 4:00:30:58.966 cpu1:4808)WARNING: VMKStateLogger: 7723: Can't create accept socket: Out of resources
appear around each time a request is send to the host. Issuing a ping from the CLI brings up the below errors:
# ping 10.0.0.1 socket() returns -1 (Cannot allocate memory) # ping foo.bar getaddrinfo() for "foo.bar" failed (Cannot allocate memory)
VM connectivity is not affected and the hosts are still responding to ICMP echo requests.
Please note that although hardware, ESXi Build and network configuration differs on each Host (please see bottom section of post) - the exact same problem occurs on every host.
I have recently added a dedicated NIC and vSwitch for the Management Network so i am positive on ruling out cabling as the culprit (Problem occured before this change already, when Management Network and VM Network had been on the same NIC/vSwitch). The Switch itself also shows no indications of a general network error (received/sent errors: zero).
The problem and symptoms are similar to these posts:
"ESXi 4.0 U1 - Management network unstable after a few days - help requested" - http://communities.vmware.com/message/1508649#1508649 (Broadcom Driver Issue, not applicable to my scenario b/c it is the e1000 driver and this should be up-to-date if i am not mistaken; 8.0.3.1-NAPI)
"Suspend VMs on a sick ESXi?" - http://communities.vmware.com/message/1534725#1534725 (no answer)
I know we do not have 100% HCL compatible HW listed here, but i really like to advertise VMware and ESXi in my company so i hope this is one "simple" configuration mistake somewhere.
Thanks a lot for your time, any help is really appreciated.
Thanks and best regards,
1bM
Hardware, Build and Network configuration information:
---
Host #1
ESXi Build 208167
Custom-Built
Supermicro X7SB4/E Mainboard
1x Intel Xeon X3220 (2,4GHz)
8GB RAM
ESXi installed on USB Stick
NIC's
vmnic0:
Connected to VLAN A on Physical Switch
PRO/1000 MT Desktop Adapter
vSwitch3 configured Virtual Machine Network #1
vmnic1:
Connected to VLAN A on Physical Switch
82573E Gigabit Ethernet Controller
vSwitch0 configured for Management Network
vmnic2:
Connected to VLAN C on Physical Switch
82573L Gigabit Ethernet Controller
vSwitch2 configured Virtual Machine Network #2
vmnic3:
Connected to VLAN B on Physical Switch
PRO/1000 PT Server Adapter
vSwitch1 configured for iSCSI Network
Host #2
ESXi Build 244038
Custom-Built
Supermicro X7SB4/E Mainboard
1x Intel Xeon X3220 (2,4GHz)
8GB RAM
ESXi installed on USB Stick
NIC's
vmnic0:
Connected to VLAN A on Physical Switch
Gigabit CT Desktop Adapter
vSwitch0 configured for Management Network
vmnic1:
Connected to VLAN B on Physical Switch
PRO/1000 PT Server Adapter
vSwitch1 configured for iSCSI Network
vmnic2:
Connected to VLAN A on Physical Switch
82573E Gigabit Ethernet Controller
vSwitch3 configured Virtual Machine Network #1
vmnic3:
Connected to VLAN C on Physical Switch
82573L Gigabit Ethernet Controller
vSwitch2 configured Virtual Machine Network #2
Host #3 (Testing only!)
ESXi Build 244038
Custom-Built
Gigabyte F15B Mainboard
AMD Athlon 64X2 3800
3G RAM
ESXi installed on SATA Disk
NIC's
vmnic0:
Connected to VLAN A on Physical Switch
PRO/1000 MT Desktop Adapter
vSwitch0 configured for Management Network and Virtual Machine Network
vmnic1:
Connected to VLAN B on Physical Switch
PRO/1000 GT Desktop Adapter
vSwitch1 configured for iSCSI Network
---
Hi - I'm the guy from this post: http://communities.vmware.com/message/1508649#1508649.
Unfortunately, I'm still experiencing the exact same issue as you report. For a while i was convienced that the issue was solved with updating the bnx2 drivers. Then the issue returned and i disabled the CIM agents (suspecting a memory leak there). This appeared to work for a while and then the issue returned once again.
However, i've one observation that might be useful:
We're running two (identical) servers. It appears that the issue only occures when the both ESXi hosts are up (in the sence that vSphere can connect). This chain of events leads me to believe this:
- both servers were inaccessible due to the issue at hand (VM's running fine, but vSphere couldn't connect)
- i disabled the CIM agents on one server and rebooted this server
- worked for weeks, vSphere was able to connect to one server (and not the other which was still to be fixed)
- i became confident that this solved the issue and implemented on the other server as well.
- rebooted the second server.
- vSphere was able to connect to both servers
- a few days later: both servers were inaccessible thru vSphere again... same story with 'cannot allocate memory'......
Hope this helps, i've been struggling with this recurring issue for weeks now
Hi Thijs,
thanks for your feedback. Sorry to hear that you are still struck by this issue 😐
Regards/Groetjes,
1bM
Hi 1bitMemory,
Did you consider a DDoS attack? We noticed a high spike in network traffic on the day the servers became unmanageble. Could this explain the problem? By looking at the logs, it seems that there is a (memory?) problem 'somewhere' in the network stack.
Hi Thijs,
well, none of the hosts are exposed to the internet and i doubt a DDoS coming from the internal LAN. However, there are no network spikes when the hosts stop responding.
Some other observations:
The Hosts do not enter "faulty" state at once; monitoring and logs show there is a "grey" area where the services are encountering errors, restarting (automatically) and then responding again just to fail definitely at one point.
I am searching for a similarity between our mentioned hosts that can cause this behaviour. I.e. all mentioned Hosts are managed through a vCenter trial as well as being monitored by Veeam Monitor and a custom Munin/Nagios vCLI Perl script. (Maybe one of these is triggering the memory leak? Out of ideas honestly).
Does any of the above ring a bell? In the way that your environment is similar?
Our setup involves two identical dell R210 servers, both experience the same issue (at assumably the same time). We don't use vCenter, only the vSphere client to manage the hosts directly. No other monitoring tools are currently in use. In fact, when the issue first occured (a few days after production deployment), the setup was very straightforward. Later on, i enabled SSH access and did some monitoring using esxtop etc trying to diagnose the problem
Regarding our little DDos theory... We did see a lot of incoming traffic on the management network interface, but at the same time, we noticed a similar amounth of outgoing traffic on the NIC that's being used by one of the VM's... Don't have an explanation for that yet.
As far as we can see now, it looks like a memory leak, or other resource exhaustion (like full buffers or all sockets being used) somewhere in the network stack. No clue about the cause unfortunately.
Hi 1bitMemory,
Not sure whether it's relevant, but you mentioned you got the following errors:
ping 10.0.0.1
socket() returns -1 (Cannot allocate memory)
ping foo.bar
getaddrinfo() for "foo.bar" failed (Cannot allocate memory
I get exactly the same, however, it's happening (after a host becomes unconnectable) intermittently. For instance, first ping succeeds, then one fails and then two pings succeed again...
Hi Thijs,
have you read the latest patch notes?
ESXi400-201005001, released Friday 27th of May 2010
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=102104
ESXi 4.0 hosts might stop responding when interrupts are shared between VMkernel and service console. You might also observe the following additional symptoms:
Network pings to the ESXi hosts might fail.
</div>
Sounds a little bit like our problem...
Hi 1bM,
Thanks for the tip! I've applied the patch (there is also a new VMware tools available) to my lab-box (which was not having the problem). However, for the production environments, i think we'll wait for the CIM patch you mentioned. We have to limit the number of times the servers are brought down a bit.
Cheers,
Thijs
Update:
Last week i applied all available patches (including 4.0 U2 (2010-06-10)) to one of our problematic hosts. I checked 5 days later: and the host - again - is unreachable dispite the patches. Does anybody have any idea what's going on?
Any updates on this issue? I'm getting the same on esxi 4.1 with latest patch level occurs every 2 months.
Just for the record in our enviroment this was caused by monitoring checking on ssh daemon.