Hardware:
Intel SR2600UR 2U Xeon (Intel S5520UR motherboard)
2 Intel Xeon x5650's
72 GB of RAM (8*6 + 4*6)
SRCSATAWB Raid 4Ch SATA PCIe w/ 8 ports Low Profile - with battery backup
5 WD1003FBYX (RE4 7200 RPM SATA 64 MB Cache Enterprise drives)
The motherboard and RAID card are both on the HCL list since prior to ESXi 5.
All firmware updates have been performed.
Symptoms/Things I've tried:
Randomly, the system will be unreachable over the network, both the physical and virtual hosts.
If I try to connect with the vSphere client, it will sometimes connect, but will not accept any commands at all (they won't even show up in the recent tasks list as failed)
I'm unable to RDP, telnet, or ping any of the hosts.
I have some systems running LogMeIn, and they bounce on and offline, but attempting to open a connection to them gives the error "The host went offline" and it never actually opens a connection even though it still shows the host as being online.
It seems to mostly happen at night or in the early morning. There are no backup jobs running during these times, and nothing heavy on network or disk IO that I know of, but there is and Exchange and SharePoint server that are doing their typical maintance tasks at night.
I haven't always been on-site when this has happened, but when I was even using the console to attempt a reboot just hung at the "Restarting" screen for over an hour and a half, so I'm always forced to power cycle the server.
I've tried looking through the different logs on the console, but not really being familiar I can't tell what I should be looking at or for.
This was also a problem when it was an ESXi 4.1 server. I was hoping the upgrade to 5.0 would fix it but it did not, and it seems to be occuring more frequently (has happened about 6-8 times so far)
As of now I'm leaning towards networking and could really use some help either pinpointing the problem, or even just getting a better idea of how/what I can monitor. I was thinking maybe to setup a syslog server (it would have to be a VM since this is the only physical box I own in the datacenter) because everytime I reboot we seem to lose all of the useful logs, and they start fresh at the restart.
Any help appreciated, this is really becoming a huge problem and it took out a vmdk for the SharePoint server today, so now I'm working on fixing that.
So last night I setup a syslog server to a different physical box. Log seems to have a lot of this:
Is this typical?