VMware Cloud Community
jmbutler99
Contributor
Contributor

Networking problem under VMWare ESXi (and ESX)

I'm wondering if anyone else has seen something like I'm seeing. We're running an ESXi server (Update 3, I've also seen this on Update 2 and ESX server) with 8 guest machines. All the guests are running Redhat Linux 4 (2.6.9-5.ELsmp). The machines are test nodes, so we need to be able to reliably ssh to them using automated tools. The failure that I'm seeing is that every so often a ssh session will hang because the server is trying to contact our NIS server over TCP. The NIS server is a machine located on our physical network. Most of the time everything works fine. Even when this particular TCP session is failing to establish itself, I can initiate other sessions.

From the ssh server perspective, I see it trying to repeatedly send TCP SYN packets to the server:

1231761122.763592 00:0c:29:a9:93:f5 > 00:01:30:ff:ae:80, ethertype IPv4 (0x0800)

, length 66: IP (tos 0x0, ttl 64, id 24801, offset 0, flags , proto 6, leng

th: 52) pavm1-2.palab.panasas.com.623 > cassoulet.panasas.com.974: S [tcp sum ok

] 3711733373:3711733373(0) win 5840 <mss 1460,nop,nop,sackOK,nop,wscale 2>

...

1231761125.760136 00:0c:29:a9:93:f5 > 00:01:30:ff:ae:80, ethertype IPv4 (0x0800)

, length 66: IP (tos 0x0, ttl 64, id 24803, offset 0, flags , proto 6, leng

th: 52) pavm1-2.palab.panasas.com.623 > cassoulet.panasas.com.974: S [tcp sum ok

] 3711733373:3711733373(0) win 5840 <mss 1460,nop,nop,sackOK,nop,wscale 2>

This goes on until the sshd fires the authentication timeout (120 seconds). If I increase the auth timeout, then it times out when Linux gives up on the connection (Linux TCP defaults to 5 syn retries with an exponential backoff, 180 seconds).

The failure always takes place when communicating with the NIS server. I've tried creating simple client/server applications that just open and close connections, but I haven't been able to reproduce this problem outside of ssh and NIS.

I also setup a port mirror for the switch port that the ESX server is sitting on. I can see that the NIS server is responding, and the port mirror is showing a SYNACK arriving:

1231761122.442195 00:0c:29:a9:93:f5 > 00:01:30:ff:ae:80, ethertype IPv4 (0x0800)

, length 66: IP (tos 0x0, ttl 64, id 24801, offset 0, flags , proto 6, leng

th: 52) pavm1-2.palab.panasas.com.623 > cassoulet.panasas.com.974: S [tcp sum ok

] 3711733373:3711733373(0) win 5840 <mss 1460,nop,nop,sackOK,nop,wscale 2>

1231761122.443194 00:01:30:ff:ae:80 > 00:0c:29:a9:93:f5, ethertype IPv4 (0x0800)

, length 66: IP (tos 0x0, ttl 61, id 0, offset 0, flags , proto 6, length:

52) cassoulet.panasas.com.974 > pavm1-2.palab.panasas.com.623: S 69

7782193:697782193(0) ack 3711733374 win 5840 <mss 1460,nop,nop,sackOK,nop,wscale

2>

So, I think the packet is being dropped somewhere in the VM. Either on the physical network card or in the switch. When I ran this same test on an ESX server, I got the same results. The machine was exactly the same hardware and guest machines, but it sat on a different switch port of the same switch. I did a tcpdump of vswif0 with the vSwitch allowing promiscuous mode and didn't see a response packet from the NIS server. This makes me think the problem is in the physical interface rather than the virtual switch.

The machines both have intel e1000 adapters:

# ethtool -i vmnic0

driver: e1000

version: 7.3.15

firmware-version: 0.15-4

bus-info: 04:00.0

Before I narrowed the problem down to something in the VM, I also tried various combinations of VMware tools. I've tried using the guest machines without vmware tools, open-vm-tools, and forcing the guests to use e1000 devices rather than the default adapters.

If you have a similar setup with ssh keys installed and NIS enabled, you could try the following script:

#!/usr/bin/tclsh

if {[llength $argv] == 0} {

puts "Usage: repeat-ssh <hostname>"

exit 1

}

set hostname

set j 0

while {

puts "Iteration: $j"

puts "Sshing to $hostname"

set start_time

set f

while {[gets $f line] >= 0} {

set curtime

puts "$hostname @ http://expr {($curtime-$start_time)/1000.0}: $line"

}

if {[catch {close $f} ouptput]} {

puts "Error running ssh..."

}

set end_time

set elapsed_ms

puts "Ssh took http://expr $elapsed_ms / 1000.0"

if {$elapsed_ms > 20000} {

puts "SLOW SSH"

exit 1

}

incr j

}

It can simply be invoked as follows:

repeat-ssh <server-hostname>, eg. repeat-ssh pavm1-2

And it usually fails with the following debug output:

pavm1-2 @ 0.102: debug1: Offering public key: /root/.ssh/identity

pavm1-2 @ 0.102: debug3: send_pubkey_test

pavm1-2 @ 0.102: debug2: we sent a publickey packet, wait for reply

<<< long pause>>>

...

Any help would be appreciated. I've looked through esx-info -n results, and it doesn't look like there are any meaningful errors that are taking place. Are there other debug or tracing utilities that I could use to figure out what's going on?

Tags (3)
Reply
0 Kudos
1 Reply
jmbutler99
Contributor
Contributor

Just so that everyone else knows, this problem has been solved, and it wasn't a problem with VMware!

The issue was actually with the hardware capturing packets for ports 623 and 664. The physical machines are Supermicro Twins that optionally have IPMI support. Our particular machines don't have IPMI cards, but it seems that the hardware is still capturing these packets. At any rate the OS needs to know that these ports are unavailable on this NIC. We solved the issue by configuring a dummy inetd service to listen on these ports. We're looking at other ways of disabling this service on machines that don't have IPMI. Another alternative may be using the other physical NIC in the machine (1 GbE adapter is probably intended for the management network and one for all of the other services...)

The above link describes this situation in a bit more detail.

Reply
0 Kudos