Intermittent loss of connectivity to Service Conso...

chindu · ‎01-28-2009

Hi,

I've got a bit of an issue with 2 of the 4 ESX servers at one site. The VC server is sending alerts that the host is down.

Incidentally the VC server is on a different site - there's a 100Mb WAN link with no issues and plenty of spare bandwidth.

The alerts are:

<snip>

Target: esx02.domain.name

Old Status: Green

New Status: Red

Current value:

Company Customised Alarm - (State = Not responding OR State = Not responding)

Alarm: Company Customised Alarm

(Intermittent loss of connectivity to Service Console port on 2 of 4 ESX hosts OR Intermittent loss of connectivity to Service Console port on 2 of 4 ESX hosts OR Intermittent loss of connectivity to Service Console port on 2 of 4 ESX hosts OR Intermittent loss of connectivity to Service Console port on 2 of 4 ESX hosts)

Description:

Alarm Company Customised Alarm on esx02.domain.name changed from Green to Red

<snip>

Obviously this isn't right so I started running continuous pings to each ESX host there as well as the switch they're conected to.

What I am seeing is that 2 of the ESX hosts are dropping pings, between 2 and 9 pings fail then the host comes back.

It's patched into a Cisco 6509 for Production LAN / VMotion and stand-alone switching for iSCSI.

I have cleared the counters on the 6509 and then checked the active Service Console port for errors / flapping from the Cisco end but nothing.

When I'm seeing the pings to the ESX host drop I the interface remains up/up at the Cisco end and no errors are ever logged (I checked the active Service Console port using CDP in the VI Client).

I'm a bit stuck as to where I should look next. The other 2 ESX hosts at the site never drop any pings. All the ESX hosts are plugged into the same Blade on the 6509 (I know - they will be split once we've cleared the physical hosts down and have some spare capacity again).

I would be very grateful if anyone could suggest where I should look next.

Many thanks.

jrenton · ‎01-28-2009

I have seen similar problems and found the issue to be DNS related. Check the DNS config for each host and also DNS to see if all hosts are registered.

You could also update the hosts file on each ESX host and add the IP addresses of all.

chindu · ‎01-28-2009

Hi,

Thanks for the reply.

I'm running "ping esx01 -t " from my machine and it resolves the IP fine - just getting the ping drops which then recover.

I've also run a ping -t to the IP address and get the same.

What I see is (for example):

Reply from 10.x.x.1: bytes=32 time=1ms TTL=62 (repeated - then:

Request timed out.

Reply from 10.x.x.1: bytes=32 time=1ms TTL=62 (repeated - then:

Lightbulb · ‎01-28-2009

So you are having intermittent packet delivery failures. I would have you network guys check Firewall interface stats, also check switchports (both local and remote) for incrementing errors.

Try continous ping tests to other assets on remote network see if there is a consistent pattern.

chindu · ‎01-29-2009

Hi,

I'm not seeing any dropped packets to any other hosts at the site. I've run continuous pings from a server at that site simultaneously to:

The IP of the vlan that the ESX hosts are connected to (the physical switch IP rather than the HSRP IP)
A VM in the ESX cluster at the site
Another switch IP at the site
The Production network Service Console of other 3 ESX hosts at the site.

I'm also the networking guy (and server guy etc etc.....).The interface that is the active SC for esx01 was cleared yesterday and a "show interface" gives:

GigabitEthernet3/1 is up, line protocol is up (connected)

+ Hardware is C6k 1000Mb 802.3, address is xxxx.xxxx.xxxx(bia xxxx.xxxx.xxxx)+

+ Description: *** ESX Hosts ***+

+ MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec,+

+ reliability 255/255, txload 1/255, rxload 1/255+

+ Encapsulation ARPA, loopback not set+

+ Full-duplex, 1000Mb/s+

+ input flow-control is off, output flow-control is off+

+ Clock mode is auto+

+ ARP type: ARPA, ARP Timeout 04:00:00+

+ Last input never, output 00:00:05, output hang never+

+ Last clearing of "show interface" counters 22:43:49+

+ Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0+

+ Queueing strategy: fifo+

+ Output queue: 0/40 (size/max)+

+ 5 minute input rate 9000 bits/sec, 4 packets/sec+

+ 5 minute output rate 31000 bits/sec, 39 packets/sec+

+ 356414 packets input, 127659641 bytes, 0 no buffer+

+ Received 15 broadcasts (0 multicast)+

+ 0 runts, 0 giants, 0 throttles+

+ 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored+

+ 0 watchdog, 0 multicast, 0 pause input+

+ 0 input packets with dribble condition detected+

+ 2831396 packets output, 310077528 bytes, 0 underruns+

+ 0 output errors, 0 collisions, 0 interface resets+

+ 0 babbles, 0 late collision, 0 deferred+

+ 0 lost carrier, 0 no carrier, 0 PAUSE output+

+ 0 output buffer failures, 0 output buffers swapped out+

I can't spot anything wrong there.....

I am only getting this problem with the 2 ESX hosts.

Thanks.

Lightbulb · ‎01-29-2009

OK so probably not the network infrastructure. Are then any differences between the host that is having the issue and those that are not? By which I mean NIC type, patch level etc.

Also from the affected host you might want to run tcpdump to a file while pinging remotely. Once you have gathered some data you can scp the Dumpfile off to your workstation and view it with Wireshark (Or just view the raw data if you are hardcore).

You also may want to tail the /var/log/vmkernel log and see if there are any unusual entries

Question

Your VMs at the remote site (Which ping fine) are they attached to the same vswitch as the SC?

chindu · ‎01-29-2009

All 4 ESX hosts are identical (Dell PowerEdge 2950). The nics are 2 x onboard Broadcom NetXtreme II BCM5708 cards and 2 x Intel Gigabit VT Quad Port ServerAdapter.

vSwitch0 which has the production network console port on is what's giving me trouble.

I have checked the vmkernel log and not found any networking error type entries.

I also got someone at the site to watch the nics and see if there was any indication of a problem when the pings were dropping out but they couldn't see anything.

I've just patched all the cluster members so they're all at the right patch level and identical. Same issue.

Think it might be time to use tcpdump....is it installed with ESX by default?

EDIT: and is there anything specific I need to be aware of to run it?

thanks.

Lightbulb · ‎01-29-2009

Yes tcpdump is installed on SC.

On a personal note (Kind of) I have had nothing but trouble with the HP adapters based on the same chipset as the NetXtreme II (The HP NC373i), not with ESX implementations so far though.

After tcpdump you might want to evacuate all VMs from that host and setup another SC portgroup on a vswitch uplinked to a different pnic and see if the issue persists. Just a thought.

chindu · ‎01-29-2009

Your last post gave me another idea. I'm removing the NICs form the vSwitch and just running them 1 at a time to see if a particular nic is playing up.

There are 5 nics for that vSwitch are spread: 1 x onboard, 2 on each Intel card.

Hopefully that will turn something up.......will post the results back.

Box293 · ‎02-21-2010

How did you fix your problem?

VCP3 32846

VSP4

VTSP4

VCP3 & VCP4 32846 VSP4 VTSP4

michaeltucker · ‎06-17-2010

I am having the same problem. No log entries. Did all the diagnostics.

What did you come up with as a solution?

Thanks