Marc_P
Enthusiast
Enthusiast

Network Keeps Dropping

Jump to solution

Since upgrading to ESX4/vSphere4 on some of our 64bit servers (Windows 2003) the network seems to drop for a short period of time.

I'm at a complete loss. Any abody have any ideas?

0 Kudos
1 Solution

Accepted Solutions
howie
Enthusiast
Enthusiast

how long is the networking stoppage (every 30 minutes)? do you see anything in /var/log/vmkernel every 30 mintues?

View solution in original post

0 Kudos
23 Replies
howie
Enthusiast
Enthusiast

Can you give more details? Like what did you test, ping? loss for how long? what virtual device do you use? can you give the result of "esxcfg-nics -l" "esxcfg-vswitch -l"?

0 Kudos
Marc_P
Enthusiast
Enthusiast

It only seems to happen on about 4 VM servers which are on different ESX hosts.

We have WhatsUpGold and the server is not pingable for about 30 seconds. Then they come back online.

# esxcfg-nics -l

Name PCI Driver Link Speed Duplex MAC Address MTU Description

vmnic0 05:00.00 bnx2 Up 1000Mbps Full 00:1d:09:07:0e:ef 1500 Broadcom Corporation Broadcom NetXtreme II BCM5708 1000Base-T

vmnic1 09:00.00 bnx2 Up 1000Mbps Full 00:1d:09:07:0e:f1 1500 Broadcom Corporation Broadcom NetXtreme II BCM5708 1000Base-T

vmnic2 0c:00.00 e1000e Up 1000Mbps Full 00:15:17:46:d3:1c 1500 Intel Corporation 82571EB Gigabit Ethernet Controller

vmnic3 0c:00.01 e1000e Up 1000Mbps Full 00:15:17:46:d3:1d 1500 Intel Corporation 82571EB Gigabit Ethernet Controller

vmnic4 0d:00.00 e1000e Up 1000Mbps Full 00:15:17:46:d3:1e 1500 Intel Corporation 82571EB Gigabit Ethernet Controller

vmnic5 0d:00.01 e1000e Up 1000Mbps Full 00:15:17:46:d3:1f 1500 Intel Corporation 82571EB Gigabit Ethernet Controller

# esxcfg-vswitch -l

Switch Name Num Ports Used Ports Configured Ports MTU Uplinks

vSwitch0 32 12 32 1500 vmnic2,vmnic4,vmnic3,vmnic1,vmnic0

PortGroup Name VLAN ID Used Ports Uplinks

VM Network 0 4 vmnic0,vmnic1,vmnic3,vmnic4,vmnic2

Service Console 0 1 vmnic0,vmnic1,vmnic3,vmnic4,vmnic2

VMkernel 0 1 vmnic0,vmnic1,vmnic3,vmnic4,vmnic2

Switch Name Num Ports Used Ports Configured Ports MTU Uplinks

vSwitch1 64 3 64 1500 vmnic5

PortGroup Name VLAN ID Used Ports Uplinks

VMotion 0 1 vmnic5

0 Kudos
howie
Enthusiast
Enthusiast

is this happening periodically? if not, when do you observe it? is there any physical NIC link up and down event around it?

0 Kudos
Marc_P
Enthusiast
Enthusiast

It seems to happen every 30 minutes.

Just out of interest should my vSwitch have more than 24 ports? It used to have 56 in ESX 3.5

0 Kudos
howie
Enthusiast
Enthusiast

how long is the networking stoppage (every 30 minutes)? do you see anything in /var/log/vmkernel every 30 mintues?

0 Kudos
Marc_P
Enthusiast
Enthusiast

Yes, every 30 to 35 minutes.

I checked the vmkernal log on vmsvr3 and this is what is in there:

May 28 11:09:33 vmsvr3 vmkernel: 1:23:12:18.565 cpu4:4206)WARNING: NMP: nmp_DeviceAttemptFailover: Logical device "naa.60a980004335434f4334506836533974": awaiting fast path state update...

May 28 11:09:34 vmsvr3 vmkernel: 1:23:12:19.564 cpu6:4206)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.60a980004335434f4334506836533974" - issuing command 0x41000201be40

May 28 11:09:34 vmsvr3 vmkernel: 1:23:12:19.564 cpu6:4206)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.60a980004335434f4334506836533974" - failed to issue command due to Not found (APD), try again...

May 28 11:09:34 vmsvr3 vmkernel: 1:23:12:19.564 cpu6:4206)WARNING: NMP: nmp_DeviceAttemptFailover: Logical device "naa.60a980004335434f4334506836533974": awaiting fast path state update...

May 28 11:09:35 vmsvr3 vmkernel: 1:23:12:20.565 cpu6:4206)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.60a980004335434f4334506836533974" - issuing command 0x41000201be40

May 28 11:09:35 vmsvr3 vmkernel: 1:23:12:20.565 cpu6:4206)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.60a980004335434f4334506836533974" - failed to issue command due to Not found (APD), try again...

May 28 11:09:35 vmsvr3 vmkernel: 1:23:12:20.565 cpu6:4206)WARNING: NMP: nmp_DeviceAttemptFailover: Logical device "naa.60a980004335434f4334506836533974": awaiting fast path state update...

May 28 11:09:36 vmsvr3 vmkernel: 1:23:12:21.564 cpu6:4206)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.60a980004335434f4334506836533974" - issuing command 0x41000201be40

May 28 11:09:36 vmsvr3 vmkernel: 1:23:12:21.564 cpu6:4206)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.60a980004335434f4334506836533974" - failed to issue command due to Not found (APD), try again...

May 28 11:09:36 vmsvr3 vmkernel: 1:23:12:21.564 cpu6:4206)WARNING: NMP: nmp_DeviceAttemptFailover: Logical device "naa.60a980004335434f4334506836533974": awaiting fast path state update...

May 28 11:09:37 vmsvr3 vmkernel: 1:23:12:22.563 cpu0:5384)ScsiDeviceIO: 747: Command 0x25 to device "naa.60a980004335434f4334506836533974" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x25 0x0.

May 28 11:09:37 vmsvr3 vmkernel: 1:23:12:22.563 cpu0:5384)WARNING: NMP: nmp_DeviceStartLoop: NMP Device "naa.60a980004335434f4334506836533974" is blocked. Not starting I/O from device.

May 28 11:09:37 vmsvr3 vmkernel: 1:23:12:22.563 cpu4:4109)WARNING: ScsiDeviceIO: 2355: READ CAPACITY on device "naa.60a980004335434f4334506836533974" from Plugin "NMP" failed. Timeout

May 28 11:09:37 vmsvr3 vmkernel: 1:23:12:22.563 cpu4:4109)FSS: 662: Failed to get object f530 28 1 4a1bfe45 1b1db0f0 1d00707f 120f0709 0 0 0 0 0 0 0 :Timeout

May 28 11:09:37 vmsvr3 vmkernel: 1:23:12:22.563 cpu4:4109)WARNING: Fil3: 1918: Failed to reserve volume f530 28 1 4a1bfe45 1b1db0f0 1d00707f 120f0709 0 0 0 0 0 0 0

May 28 11:09:37 vmsvr3 vmkernel: 1:23:12:22.563 cpu4:4109)FSS: 662: Failed to get object f530 28 2 4a1bfe45 1b1db0f0 1d00707f 120f0709 4 1 0 0 0 0 0 :Timeout

May 28 11:09:37 vmsvr3 vmkernel: 1:23:12:22.563 cpu0:4096)VMNIX: VMKFS: 2011: timed out

May 28 11:09:37 vmsvr3 vmkernel: 1:23:12:22.563 cpu0:4096)VMNIX: VMKFS: 2521: status = -110

May 28 11:09:37 vmsvr3 vmkernel: 1:23:12:22.564 cpu6:4206)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world restore device "naa.60a980004335434f4334506836533974" - no more commands to retry

It looks like it is having a problem! Perhaps you can shed some light

0 Kudos
depping
Leadership
Leadership

The log file indicates SAN issues. Are you using shared storage? If so, FC / iSCSI. Array / Model?

Duncan

VMware Communities User Moderator | VCP | VCDX

-


Blogging:

Twitter:

If you find this information useful, please award points for "correct" or "helpful".

0 Kudos
Marc_P
Enthusiast
Enthusiast

On 2 of our ESX hosts it had storage device that was no longer available. I have removed these so should get rid of the errors. However could this cause network issues like we are experiencing?

0 Kudos
Marc_P
Enthusiast
Enthusiast

Yes shared storage, iSCSI - NetApp 3020C

0 Kudos
depping
Leadership
Leadership

I would not expect that this is the cause, but this is what your log files say. Could you attach the full log file? Could you also do a constant ping and vmotion the VM you are pinging to a different host. What's the result?

Duncan

VMware Communities User Moderator | VCP | VCDX

-


Blogging:

Twitter:

If you find this information useful, please award points for "correct" or "helpful".

0 Kudos
Marc_P
Enthusiast
Enthusiast

Well since I removed the rogue storage it has been OK for an 1 and half hours. So I suspect this was what was causing the issue.

Thanks for the help!

0 Kudos
tonybunce
Contributor
Contributor

We are seeing the exact same problem. Details and log files here: http://communities.vmware.com/thread/213710 (created a new thread before I found this one).

I don't think it is just dropping the network, all of the VMs are hung/suspended/waiting during that time period. If i start a ping from a VM I am remotely connected to while the problem is happening my ssh session locks up but when it comes back the ping keeps going without ever dropping a packet (so it stops sending pings while the problem is happening). If I ping from a different host to a guest that is experiencing the problem then the guest VM doesn't respond. Also during that time the clocks on the VMs get all out of whack (I only know this because we monitor the NTP offset and we are seeing the VM clocks go behind as much as 100 seconds).

0 Kudos
paithal
VMware Employee
VMware Employee

Marc, These logs here is the after affect of either network or storage going down. I would like to know why ESX lost connectivity with storage. Is it something that you are purposely doing it ?. I need to look at the logs where I see some messages from "iscsi_vmk" (which is s/w iscsi driver on ESX 4)driver. Could you upload those log files ?.

0 Kudos
pizzingrilli
Contributor
Contributor

We ran in to the same problems after our SAN admin deleted a LUN which where used from ESX 3.5 and 4.0 systems. It looks like vSphere 4.0 tries to contact the LUN's every 30 minutes and because the LUN status was "dead" ESX 4.0 suspeded all VM's for a few seconds (10-12 pings) every 30 minutes.

The problem is gone away after a simple iSCSI rescan. Seems to be a bug in vShere 4.0 because ESX 3.5 doesn't has this problem.

0 Kudos
jaredo
Contributor
Contributor

Had the same issue after removing a iscsi device I thought wasn't in use anymore. Made the VI client inaccessible and had to rescan from the CLI, then everything came up and errors went away.

  1. esxcfg-rescan vmhba33

0 Kudos
gveahelpdesk
Contributor
Contributor

Anybody have any tricks to make 4.0 not as picky? Maybe it isn't the best practice but I've normally disabled these LUNs and eventually through reboots/escans/etc it clears out the dead ones.

Obviously in 4.0 this is like shooting your server. I'll have to test out if this freaks out even if I disable a LUN and then rescan before this 30 minute freakout period. As stated before my 3.5 servers were just fine. In fact I rolled back a VM from 4.0 to 3.5 because of this problem.

I'm also using distributed switches too so I'm not sure if that has added something bad.

S

0 Kudos
pauliew1978
Enthusiast
Enthusiast

this is not a distributed switch config. I have run into this problem twice now and its pretty scary stuff. If you have a lun that goes offline all your vm's in the cluster will freak out every 30 mins.

I am going to log a support call with vmware about it.

0 Kudos
Marc_P
Enthusiast
Enthusiast

I'd be interested to hear what support have to say....

0 Kudos
tonybunce
Contributor
Contributor

I created a support ticket but unfortunately was unable to commit the resources to track down the problem.

Here is the info they wanted:

Please have your environment running with the iSCSI openfiler target attached. Log into the console as root and issue the following command: vm-support -s -d 3000 Please make note of the time, and then wait a minute or two. At this point, please re-create the issue and let it go for about a half an hour (the time frame reported where the issue repeats). When the command completes (50 minutes after it was originally issued), it will generate a log bundle that I would like you to please upload to our FTP server for further analysis.

In my environment it appears that the problem doesn't start right after the iSCSI target goes away, there is a delay between when it does down and when the 30min problem interval starts. Since this is on a production system I have to do it after hours and haven't had the time to reproduce the problem while running the vm-support command. Plus I don't really like intentionally causing problems in a production environment.

0 Kudos