ESX4i Solaris 10 e1000g lockup issue

RHack · ‎05-06-2010

Hi,

we installed 3 guest OS (2x App server running JBoss, 1x Oracle server) with the same Solaris 10 ISO installation disk and patched with the latest patch cluster. The host is a DL380G5 2xQuad core, 3GHz + 24GB memory, full disk set. There is another Debian server, and a Windows 2003 server on the same host. All servers with the exception of the Jboss server run without any network issue. I am running the latest version of ESX4i downloaded from this very site and have installed the vmware tools successfully.

One App server is currently idle awaiting the deployment, the other one we deployed for testing. Within minutes of booting up the test application server the network interface locks up. Snoop on the interface shows incoming packets but no traffic outgoing. Unplumb, plumb, svcadm restart network/physical brings the network interface back to live. However, anywhere between 10 to 30 minutes the same issue happens again. The interface seizes up and only incoming traffic using snoop.

We logged a whole session using snoop into a file and there appears to be no traffic issue up until the interface freezes. After freezing the interface shows as "up" in dladm, shows no collision or other errors in netstat; however "kstat -m |grep Err" shows 'Rx size error' increasing by one after every freeze. netstat -nr does not show any changes before and after. We did fix the sq_max_size to 0 (as there historically was an issue the streams QFULL signal on this interface). We fixed the speed to 100FDX. We also cleared arp tables entries which then stay empty, presumably arp requests are not getting out.

There appears to be nothing wrong with the server. One aspect though, and this could be a red herring: the run queue length on booting the guest OS is well in excess of 40,000 but drops immediately. At the same time the server appears to be idle and responsive. After a minute the cpu load drops to 0, well before the interface lockup.

Did we miss anything here ? Is there a known issue ? Are there any diagnostic tricks that we can apply to coerce more information about the problem ? Is this interface installed from or dependent of the vmware guest tools?

Any help would be appreciated.

RHack · ‎05-07-2010

For unrelated reasons, the vmware host and all guests were moved from 172.22.xx.0/24 to a 10.x.x.x/24 address range.Since the move the issue has not reoccured.

Another suggestion I got was to check netstat and count the number of closing TCP connections. However, there has not been another occurance for the moment hence I was not able to test this.

RHack · ‎05-16-2010

Hi,

more information came to light and a possible solution. The communication was taking place between a Windows Server 2003 and the Solaris 10 environment. As it turned out, the Windows Server 2003 Intel interface had turned on large packet offloading. As Wireshark then observed is that the packet going to the interface grew beyond MTU size. Once I switched off large packet offloading through the interface advanced properties Wireshare did not show this problem anymore. Additionally, the Solaris interface has been stable for a few days now.

Hence, I believe that the Solaris 10 interface lockup is a bad failure handling of oversized packages causing the interface to lock up. At the same time the Windows driver appear to have a bug that allows packets larger than the MTU to escape down the virtual wire. Just to be on the safe side, I also disabled other offloading facilities, such as the checksum offloading.

In summary, two bug, one on the Windows Intel driver and one on the Solaris e1000g driver appear to joined forced to cause the lockup.

daiwatt · ‎07-27-2010

I've also seen this on 2 different installations in our lab with E1000g interfaces - not easily repeatable on the one system. Using ESXi 4.0.0. Fundamental bug that will stop us deploying - any fix available?

RHack · ‎07-27-2010

Hi,

we have experienced no further issues with the network interface after switching off large packet offloading on our windows guest. I can only assume that you need to perform the same step for your environment as well to ensure no big packets are received on the solaris interface.

daiwatt · ‎07-30-2010

I think we might have slightly different issues here. This thread is discussing e1000g lockup, whereas we are seeing the entire Solaris 10 VM freeze occasionally when using the snoop command. The only option is then to power cycle the VM.