VMware Cloud Community
Littler888
Contributor
Contributor

Physical Host Networking Drops Out

Hardware:

Intel SR2600UR 2U Xeon (Intel S5520UR motherboard)

2 Intel Xeon x5650's

72 GB of RAM (8*6 + 4*6)

SRCSATAWB Raid 4Ch SATA PCIe w/ 8 ports Low Profile - with battery backup

5 WD1003FBYX (RE4 7200 RPM SATA 64 MB Cache Enterprise drives)

The motherboard and RAID card are both on the HCL list since prior to ESXi 5.

All firmware updates have been performed.

Symptoms/Things I've tried:

Randomly, the system will be unreachable over the network, both the physical and virtual hosts.

If I try to connect with the vSphere client, it will sometimes connect, but will not accept any commands at all (they won't even show up in the recent tasks list as failed)

I'm unable to RDP, telnet, or ping any of the hosts.

I have some systems running LogMeIn, and they bounce on and offline, but attempting to open a connection to them gives the error "The host went offline" and it never actually opens a connection even though it still shows the host as being online.

It seems to mostly happen at night or in the early morning. There are no backup jobs running during these times, and nothing heavy on network or disk IO that I know of, but there is and Exchange and SharePoint server that are doing their typical maintance tasks at night.

I haven't always been on-site when this has happened, but when I was even using the console to attempt a reboot just hung at the "Restarting" screen for over an hour and a half, so I'm always forced to power cycle the server.

I've tried looking through the different logs on the console, but not really being familiar I can't tell what I should be looking at or for.

This was also a problem when it was an ESXi 4.1 server. I was hoping the upgrade to 5.0 would fix it but it did not, and it seems to be occuring more frequently (has happened about 6-8 times so far)

As of now I'm leaning towards networking and could really use some help either pinpointing the problem, or even just getting a better idea of how/what I can monitor. I was thinking maybe to setup a syslog server (it would have to be a VM since this is the only physical box I own in the datacenter) because everytime I reboot we seem to lose all of the useful logs, and they start fresh at the restart.

Any help appreciated, this is really becoming a huge problem and it took out a vmdk for the SharePoint server today, so now I'm working on fixing that.

0 Kudos
1 Reply
Littler888
Contributor
Contributor

So last night I setup a syslog server to a different physical box. Log seems to have a lot of this:

<166>1 2011-12-13T12:17:33.587Z SPHPV01 Hostd - - - Hostd: [37D37B90 error 'SoapAdapter.HTTPService'] HTTP Transaction failed on stream TCP(local=127.0.0.1:0, peer=127.0.0.1:55967) with error N7Vmacore15SystemExceptionE(Connection reset by peer)
<166>1 2011-12-13T12:17:40.027Z SPHPV01 Hostd - - - Hostd: [379EEB90 verbose 'Cimsvc'] Ticket issued for CIMOM version 1.0, user root
<166>1 2011-12-13T12:19:10.377Z SPHPV01 Hostd - - - Hostd: [384C6B90 verbose 'Cimsvc'] Ticket issued for CIMOM version 1.0, user root
<166>1 2011-12-13T12:20:40.727Z SPHPV01 Hostd - - - Hostd: [37E64B90 verbose 'Cimsvc'] Ticket issued for CIMOM version 1.0, user root
<166>1 2011-12-13T12:20:57.996Z SPHPV01 Hostd - - - Hostd: [38381B90 verbose 'DvsManager'] PersistAllDvsInfo called
<166>1 2011-12-13T12:22:11.074Z SPHPV01 Hostd - - - Hostd: [37D78B90 verbose 'Cimsvc'] Ticket issued for CIMOM version 1.0, user root
<166>1 2011-12-13T12:22:34.019Z SPHPV01 Hostd - - - Hostd: [38041B90 verbose 'Proxysvc Req01075'] New proxy client TCP(local=127.0.0.1:80, peer=127.0.0.1:59177)
<166>1 2011-12-13T12:22:34.020Z SPHPV01 Hostd - - - Hostd: [FFC81B90 info 'Vmomi'] Activation [N5Vmomi10ActivationE:0xa3bfc50] : Invoke done [waitForUpdates] on [vmodl.query.PropertyCollector:ha-property-collector]
<166>1 2011-12-13T12:22:34.021Z SPHPV01 Hostd - - - Hostd: [FFC81B90 verbose 'Vmomi'] Arg version:
<166>1 2011-12-13T12:22:34.021Z SPHPV01 Hostd - - - Hostd: --> "195"
<166>1 2011-12-13T12:22:34.021Z SPHPV01 Hostd - - - Hostd: [FFC81B90 info 'Vmomi'] Throw vmodl.fault.RequestCanceled
<166>1 2011-12-13T12:22:34.021Z SPHPV01 Hostd - - - Hostd: [FFC81B90 info 'Vmomi'] Result:
<166>1 2011-12-13T12:22:34.021Z SPHPV01 Hostd - - - Hostd: --> (vmodl.fault.RequestCanceled) {
<166>1 2011-12-13T12:22:34.021Z SPHPV01 Hostd - - - Hostd: -->    dynamicType = <unset>,
<166>1 2011-12-13T12:22:34.021Z SPHPV01 Hostd - - - Hostd: -->    faultCause = (vmodl.MethodFault) null,
<166>1 2011-12-13T12:22:34.021Z SPHPV01 Hostd - - - Hostd: -->    msg = "",
<166>1 2011-12-13T12:22:34.021Z SPHPV01 Hostd - - - Hostd: --> }

Is this typical?

0 Kudos