VMware Cloud Community
td3201
Contributor
Contributor

ESX 4.0.0 - network drops and VMs becoming unavailable on the network

Hello,

I have an intermittent, wide-spread issue that has been difficult to dig into. We have VMs that occasionally become unavailable to our monitoring solution. They drop network connection or exhibit high RTA. Our monitoring solution is physical and we aren't seeing any network issues with it throughout the physical network infrastructure all the way up to the ESX hosts. I would like some insight into the virtual switching infrastructure to debug this. I would be looking for retransmits or drops at that layer. In parallel, I am checking into our storage infrastructure to see if I can correlate high IOPS with these drops but I am curious what you guys think.

Reply
0 Kudos
14 Replies
marcelo_soares
Champion
Champion

Good analysis. Are they short outages (like 1-5 secs)? Do it happen in a fixed amount of time (like from 5 to 5 minutes)? Check the /var/log/vmkernel log from the ESX hosts and paste here anything you think unusual.

Marcelo Soares

VMWare Certified Professional 310/410

Virtualization Tech Master

Globant Argentina

Consider awarding points for "helpful" and/or "correct" answers.

Marcelo Soares
td3201
Contributor
Contributor

Great question. Our monitoring solution doesn't check that frequently so the length of downtime can't be determined down to the minute unfortunately. However, I can tell you that I recently saw a high RTA for a VM for 30 seconds at the very least. It could have been going on longer. And by high RTA, I mean 500 ms in a gigabit network. I am running a very crude ping side by side between this VM and the management interface of it's host to see if I can see some correlation but nothing yet. However, I see occasional jumps in the response such as this:

64 bytes from hostname (10.12.1.10): icmp_seq=2664 ttl=128 time=0.366 ms

64 bytes from hostname (10.12.1.10): icmp_seq=2665 ttl=128 time=0.373 ms

64 bytes from hostname (10.12.1.10): icmp_seq=2666 ttl=128 time=0.312 ms

64 bytes from hostname (10.12.1.10): icmp_seq=2667 ttl=128 time=0.327 ms

64 bytes from hostname (10.12.1.10): icmp_seq=2668 ttl=128 time=101 ms

64 bytes from hostname (10.12.1.10): icmp_seq=2669 ttl=128 time=0.297 ms

64 bytes from hostname (10.12.1.10): icmp_seq=2670 ttl=128 time=13.9 ms

64 bytes from hostname (10.12.1.10): icmp_seq=2671 ttl=128 time=0.331 ms

64 bytes from hostname (10.12.1.10): icmp_seq=2672 ttl=128 time=0.403 ms

In theory, that is high for an internal data center network. Just in my short time, I see no correlation between high RTA on the VM and on the management interface of the host. Still digging. Smiley Happy

Reply
0 Kudos
jcwuerfl
Hot Shot
Hot Shot

Have you tried watching the %DRPRX in ESXTOP ? see if any packets are getting dropped? Is this a gigE adapter? could also check the physical port and see if there are any errors present there.

Here is a couple of docs that may help.

http://communities.vmware.com/docs/DOC-5240

http://communities.vmware.com/docs/DOC-5500

td3201
Contributor
Contributor

I didn't know of these metrics. Just watching hit here, I don't see anything for this VM:

PORT-ID USED-BY TEAM-PNIC DNAME PKTTX/s MbTX/s PKTRX/s MbRX/s %DRPTX %DRPRX

50331831 27214:hostname vmnic6 vSwitch2 23.65 0.07 21.74 0.01 0.00 0.00

I have been watching the following items pretty close from a physical monitoring host:

-firewall

-esx host (management interface)

-vm guest

-another physical host

The only one that seems to jump to higher RTAs (>10ms) is the vm guest. Everything else stays < 1 ms for the most part. Every once in a great while something will jump to 1.xx ms but its rare. Again, pretty crude but at least I am seeing consistency. So I think I am on the right track by digging into the VM side at least but need to find the metrics that point to the issue.

Reply
0 Kudos
MartinAmaro
Expert
Expert

A very basic and simple check is to make sure your phyiscla ESX host network adapters' speed duplex settings are setup to match the switch port they are connected to exmaple 10000 Full or Auto negociate, 100 Full duplex

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful.
Reply
0 Kudos
td3201
Contributor
Contributor

All of my esx hosts are set to 1000/full and that's what they are reporting.

Reply
0 Kudos
marcelo_soares
Champion
Champion

Did you checked vmkernel log? sometimes, problems related to storage access can generate things like this.

Marcelo Soares

VMWare Certified Professional 310/410

Virtualization Tech Master

Globant Argentina

Consider awarding points for "helpful" and/or "correct" answers.

Marcelo Soares
Reply
0 Kudos
td3201
Contributor
Contributor

Sorry, lost sight that you had asked me to look at that.

A few things stand out:

Sep 15 21:01:46 host vmkernel: 169:06:43:13.617 cpu6:4118)Config: 289: "VMOverheadGrowthLimit" = -1, Old Value: 0, (Status: 0x0)

Sep 15 21:02:00 host vmkernel: 169:06:43:26.995 cpu3:4117)Config: 289: "VMOverheadGrowthLimit" = 0, Old Value: -1, (Status: 0x0)

Sep 14 01:17:03 host vmkernel: 167:10:58:36.496 cpu8:711387)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x410001156900) to NMP device "naa.6090a01850b5208033a0841b0000707 f" failed on physical path "vmhba33:C1:T1:L0" H:0x5 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x1.

Sep 14 01:17:03 host vmkernel: 167:10:58:36.496 cpu8:711387)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "naa.6090a01850b5208033a0841b0000707f" state in doubt; requested fast path state update...

Sep 14 01:17:03 host vmkernel: 167:10:58:36.496 cpu8:711387)ScsiDeviceIO: 747: Command 0x2a to device "naa.6090a01850b5208033a0841b0000707f" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x1.

Sep 14 01:13:34 host vmkernel: 167:10:55:07.861 cpu14:4196)FS3: 6909: Waiting for timed-out heartbeat http://HB state abcdef02 offset 3481600 gen 495 stamp 14468102004811 uuid 4bb24c7d-b7679e9e-6e27-00219b9c0a73 jrnl <FB 221479> drv 7.33

Sep 14 01:06:40 host vmkernel: 167:10:48:13.913 cpu8:4621)WARNING: iscsi_vmk: iscsivmk_TaskMgmtIssue: vmhba33:CH:1 T:1 L:0 : Task mgmt "Abort Task" with itt=0x8790b71 (refITT=0x8790b5f) timed out.

Sep 11 08:19:10 host vmkernel: 164:18:00:53.619 cpu5:4621)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:0 T:16 CN:0: iSCSI connection is being marked "OFFLINE"

Sep 11 08:19:10 host vmkernel: 164:18:00:53.619 cpu5:4621)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess

Sep 11 08:19:10 host vmkernel: 164:18:00:53.619 cpu5:4621)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn

Sep 11 08:19:10 host vmkernel: 164:18:00:53.619 cpu12:4131)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba33:C0:T16:L0 to vmhba33:C1:T16:L0for device "naa.6090a018a0c2e6c86bc804f96c073000".

Sep 11 08:19:21 host vmkernel: 164:18:01:04.254 cpu11:4107)NMP: nmp_CompleteCommandForPath: Command 0x12 (0x41000105c1c0) to NMP device "mpx.vmhba1:C0:T0:L0" failed on physical path "vmhba1:C0:T0:L0" H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

Sep 11 08:19:21 host vmkernel: 164:18:01:04.254 cpu11:4107)ScsiDeviceIO: 747: Command 0x12 to device "mpx.vmhba1:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

As you can see, these are over several days so I am not sure if they are related or not. Certainly something to consider though.

Reply
0 Kudos
jcwuerfl
Hot Shot
Hot Shot

So your using iSCSI for storage? are the uplinks for iSCSI storage on different physical vSwitchies, nic's from your VM network ? Could you explain more on how you have your vSwitches, Port Groups, and Uplinks configured for VM network, iSCSI, VMkernel, VMotion etc.

Typically you want your iSCSI on a different physical uplink ports from your other traffic then you also have the option of doing Jumbo packets as well which typically helps performance with iSCSI. Or is this 10gE ? are you using vlans at all?

Reply
0 Kudos
td3201
Contributor
Contributor

Switching gears a bit. Our architecture on the network ESX side is:

-3 vswitches

-2 gigabit nics per vswitch

vswitch0 - vmk0, vswif0

vswitch1 - vswif1 (iscsi service console), vmk1 (iscsi nic), vmk2 (iscsi nic)

vswitch2 - trunk, vlan tagging vm port groups

The two nics for each vswitch is part of a different physical NIC. In other words, for vswitch2, it is an intel and broadcom nic (igb, bnx2). Each nic is set to active/active. The failover and load balancing policy is as such:

Load Balancing: Port ID

Network Failure Detection: Link Status only

Notify Switches: yes

Failback: Yes

No standby or unused adapters.

I didn't set any of these servers up but its pretty consistent across the board this way.

Reply
0 Kudos
jcwuerfl
Hot Shot
Hot Shot

Although, if your seeing no %DRPTX %DRPRX in ESXTOP it may not be a network issue so keep monitoring that when you see the issue from your monitoring server. Although, I wonder if there is something to the iSCSI and path switchover issue in the vmkernel log so perhaps monitor that as well when you see the issue to see if that is switching between network cards.

Reply
0 Kudos
jcwuerfl
Hot Shot
Hot Shot

That sounds pretty good for the vSwitch setup. One question with the vSwitch 2 and trunk if those uplinks have spanning-tree portfast trunk set on those ports?

Here is an example:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100662...

Reply
0 Kudos
td3201
Contributor
Contributor

The symptoms we see are not the same in that KB article but I like the thinking. We are using foundry switches and the portfast concept doesn't exist.

Reply
0 Kudos
td3201
Contributor
Contributor

I'm switching gears for now. I know for sure that increased VM usage causes intermittent outages for us. I look at the performance graphs in vcenter at all levels and don't see anything exciting. I am mostly concerned in the storage at this point. I am looking at read/write latency from within vsphere. We have an iscsi environment and we have average read latencies for the last week as high as 30ms. This doesn't sound too bad on the surface but curious what you guys think.

Reply
0 Kudos