Ok, I'm running ESXi 4 build 244038. I have the following VM's installed: Ubuntu, Windows Server 2003, 2x Windows Server 2008 R2, Windows 7 x64.
Randomly throughout the day the Server 2008 R2 machines and the Windows 7 machine stop responding. I have been running a continuous ping on all three machines for a couple of days now. They all stop responding at the exact same time. While this is happening they cannot be accessed through the VSphere console. The server 2003 and Ubuntu machines continue to function like normal.
I've replaced the SVGA driver on the 3 problem machines, changed the NIC from the E1000 to the VMXNet3 (I tried 2 as well on the Server 2008's), tried without and with VMWare tools installed, uninstalled all AV software, stopped QoS and IPv6, changed IP's, etc.
The machines will usually start responding on their own after a few minutes, but occasionally they completely lock up. I cannot power off the VM's, I must reboot the ESXi server to regain functionality...
Would really appreciate some help on this, as I've been researching and troubleshooting for several days now. If more information is required, please let me know!
Updates: If I log into the console and run "vim-cmd vmsvc/get.guestheartbeatStatus" for the problem VM's they come back as red during the outages. If they come back on their own it changes to yellow.
Check the messages/hostd/vpxa logs on the host. I've seen cases where VM's would drop pings that was caused by the SAN they are on actually having a problem (slow or busy), also make sure the management network, vmotion and VM networks are separated either physically or with vlans. Some guest OS's will handle hiccups better than others so don't assume it isn't something affecting the whole environment just because the VM's don't all react the same way.
Hope that helps
Hanna
Check the messages/hostd/vpxa logs on the host. I've seen cases where VM's would drop pings that was caused by the SAN they are on actually having a problem (slow or busy), also make sure the management network, vmotion and VM networks are separated either physically or with vlans. Some guest OS's will handle hiccups better than others so don't assume it isn't something affecting the whole environment just because the VM's don't all react the same way.
Hope that helps
Hanna
Hanna, thanks for the advice! I dont think this was the solution to my problem, but I do think it pointed me in a very promising direction. Going through the logs I found several entries like: (Command 0x0 to device "mpx.vmhba0:C0:T0:L0" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.). A quick search of this turned up some useful references.
I have all of the machines up and running right now with no problems so far. The three VM's were all set to use VMHBA0:C0:T0:L0 on the host as their CD-ROM. I changed the setting so that they are not connected at power on and so far so good. I'll look into applying the Firmware update once I see how succesfull this is!
Again, thanks for your help! I'll leave this open for 24 hours to make sure the problem is resolved and then I'll award you the answer.
Update: Everything continues to run flawlessly! 0% packet loss over a 24 hour period. Thanks again Hanna for pointing me in the right direction!