VMware Cloud Community
kathmann
Contributor
Contributor

Guest network connection lost at random

Hi,

As of very recently we have been experiencing random network disconnections in our guests, without any visible reason or order, which are driving me up the wall...

Our setup

We have three ESX 3.5 servers, each connected to the LAN via three vSwitches:

- one vSwitch with a Service Console port group and 2 physical NICs

- one vSwitch with a VMkernel port group and 2 physical NICs

- one vSwitch with several VMnet port groups, one for each VLAN, with 6 physical NICs, one of which is designated the standby adapter

All physical NICs connect to two HP Procurve 5406zl core switches (the five active ports to one switch, the standby port to the second one), on which all the linked ports have the VLANs in use set in tagged mode.

An edited part of the switch config (ports A15 through A18 are connected to one ESX server):

vlan 10

name "PROD"

tagged A15-A18

exit

vlan 20

name "TEST"

tagged A15-A18

exit

vlan 30

name "UITW"

tagged A15-A18

exit

spanning-tree

spanning-tree B8 path-cost 50

spanning-tree Trk1 priority 4

spanning-tree config-name "SITE01"

spanning-tree legacy-path-cost

spanning-tree force-version STP-compatible

The vSwitches are set up as follows:

- Promiscuous Mode: Reject

- MAC Address Changes: Accept

- Forged Transmits: Accept

- Traffic shaping: disabled

- Load Balancing: Route based on the originating virtual port ID

- Network Failover Detection: Link status only

- Notify switches: Yes

- Failback: Yes

The guests have two virtual network adapters: one for our production LAN (VLAN 10) and one for iSCSI access to our NAS (VLAN 106).

Our problem

Since very recently VM's randomly lose network connections. Windows does not show the link as disconnected, but still cannot get traffic in or out to other systems, except to guests that are on the same ESX server (which soft of makes sense as this traffic never actually touches the physical adapter). The really weird bits are:

- A single VM on one ESX may suddenly have this problem at any time, while the other VM's on the same ESX still work fine

- A single VM may have this problem on one NIC but not on both, or sometimes on both cards at the same time

- Neither Windows or VMware report any issues/events/etc.

Does anyone have experience with issues like these? Is this a known issue (I could not find any info on this while search through the discussions here)?

Any help would be greatly appreciated!

Mark.

Reply
0 Kudos
21 Replies
Gerrit_Lehr
Commander
Commander

Are the VMware Tools up to date and the virtual NIC driver set to vmxnet? I experienced problems like that when using the vlance driver in windows. Are all VMs affected or only a few?

Kind Regards,

Gerrit Lehr

If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".

Kind regards, Gerrit Lehr If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".
Reply
0 Kudos
kathmann
Contributor
Contributor

Yes, the VMware Tools are up to date, and the VMs are using the vmxnet drivers. As to the affected systems: it appears to affect systems at random. A VM that works fine today may experience the problems tomorrow, and a system that borked today may run fine for a long while afterwards.

At the moment the quickest 'fix' when a VM disconnects is to log in to the console and disable and re-enable the NIC, this usually fixes the problems (at least temporarily).

Reply
0 Kudos
Gerrit_Lehr
Commander
Commander

Yeah, that is exactly the same workaround that I used until I figured out the problem.

Maybe some of these suggesntions work:

http://communities.vmware.com/thread/193905

Kind Regards,

Gerrit Lehr

If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".

Kind regards, Gerrit Lehr If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".
Reply
0 Kudos
kathmann
Contributor
Contributor

Thanks for the hints, we'll be testing switching to the pcnet32 driver very soon. Any experiences with that out there?

I've also been reading a lot of KB articles, and one article (1009103) suggested that the problem may lie in the available number of virtual ports on the vSwitch. The output from my esxcfg-vswitch -l looks like this:

Switch Name Num Ports Used Ports Configured Ports MTU Uplinks

vSwitch0 64 26 64 1500 vmnic5,vmnic4,vmnic8,vmnic9,vmnic6,vmnic7

This is for a 56-port vSwitch...should I worry? I think not, as the number of used ports appears to be 26...

Reply
0 Kudos
kjb007
Immortal
Immortal

I would also suggest using the Enhanced VMXNet NIC for your virtual machines. The PCnet32 is the flexible NIC, which after the vmware tools, uses the vmxnet driver. The enhanced vmxnet NIC type is completely virtualized, and needs the vmware tools driver to work. I've found it to be more reliable.

Also, one other thing to check is the memory usage on the service console. This can also cause problems with networking. If you log into the service console, run 'free -m' when you are having problems. By default, you will have 272M of memory allocated to the service console, and this can get used up and then the service console will swap to disk. You can raise this to 800M max, which would help you get around this problem, if you are running into memory issues.

-KjB

VMware vExpert

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
kathmann
Contributor
Contributor

Thanks for the tips, but our SC memory is already at 800 MB and all counters look normal (no memory overloading or excessive swapping), so I'm not completely convinced it's a memory issue...

We're looking into the drivers, but must be done outside of office hours (yay, overtime! ;-} ).

Reply
0 Kudos
kjb007
Immortal
Immortal

One other thing that can do this is problems with your disk. This seems not very intuitive, but I'd run some IOmeter tests on your vm itself, and see if you are getting high latency. I've run into scenarios when high disk latency causes ping failures, when there is excessive I/O trying to go to that disk. Just another thing to check, and you can do so without having to modify any drivers.

-KjB

VMware vExpert

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
kathmann
Contributor
Contributor

Thanks, we'll definitely try that in our tests! I'll post the results back here as soon as we have some.

Reply
0 Kudos
Scissor
Virtuoso
Virtuoso

Could it be a duplicate MAC address problem?

Reply
0 Kudos
mlubinski
Expert
Expert

hi,

yeah, we encoutered the same issue. It looked similiar (customer reported, that his VM does not respond to pings. After logging into VM I could ping some other VMs (via internal IPs), but after few minutes this also stopped working. The solution was to disable/enable (or repair) Network interface.

I didn't have much these issues, so didn't dig into this. But I think this is some kind of VMware bug in there.

[I]If you found this or any other answer useful please consider the use of the Helpful or correct buttons to award points[/I]
Reply
0 Kudos
kathmann
Contributor
Contributor

Update: we have been testing several possible solutions, such as moving a number of the vNICs to a different vSwitch, but still to no avail.

A support call has now been logged with VMware, will update here as this progresses.

Reply
0 Kudos
obsidian009
Contributor
Contributor

Hi -- did you ever resolve the issue with support? We seem to be having a very similar issue and was curious if you found a fix.

Thx

Reply
0 Kudos
kathmann
Contributor
Contributor

Yes, we figured it out.

It was quite basic really: our Vswitch is connected to several physical NICs and uses the default method of load balancing. One of the physical NICs was not functioning properly although no errors were thrown in any log anywhere. So, when the server distributed the virtual NICs it also linked some vNICs to the faulty pNIC, with the predictable result of no external connectivity. A reset of the vNIC from within the guest would trigger a link to a different pNIC which solved the issue for that moment.

So: we fixed the faulty NIC, re-enabled it and hey presto: issue solved!

Reply
0 Kudos
obsidian009
Contributor
Contributor

Hm...sounds similar to what we're seeing, but not the same exactly. In your case with a faulty pNIC, I presume this was affecting more than one VM on that vSwitch right? We keep having a specific guest lose its connection and all other guests on the same vSwitch are fine. We're doing the same thing to fix it when it happens though...disable/re-enable the NIC and it instantly comes back. We've tried reinstalling vmware tools, removing the vNic and adding a new one with a different MAC, etc.

I was just working with vmware support last night and we left off by trying to use e1000 instead of vmxnet3. It seems stable for now, but we'll see...

thx

Reply
0 Kudos
CCarpenter
Contributor
Contributor

We had this same issue in our 4.0 environment occasionally. After reading this post, looking into it, we noticed that one of the 4 NICs were not sending back any CDP information. and sure enough, all of the afflicted VM were on that NIC at the time of failure.

All in all, thanks for your past discoveries. It really helped us out.

Reply
0 Kudos
aryan14in
Enthusiast
Enthusiast

What is ESX H/W? Are these blades with Flex 10?

Reply
0 Kudos
CCarpenter
Contributor
Contributor

We are running Dell m710 blades. What keyed us off to the bad NIC is that one of the NICs were not returning any CDP (Cisco Discovery Protocol) info. you can find this by looking at your distributed virtual switch and click on the small blue "i" next to the physical NIC.

Yes we do use FLEX addressing as well.

Reply
0 Kudos
aryan14in
Enthusiast
Enthusiast

I am assuming you have Broadcom nics and below description is based on that assumption:

There is a known issue with Flex10 1/10GB module and ESX3.5 U3~U4. Don't worry about CDP info with Flex10s. The issue occurs because a specific type of TCP packets (more than 1500 MTU aka Jumbo frames) is passed through vmnics. This fills the nic buffer and further nic is unable to acknowledge packets. One of the way to verify this is to look for "crash" in /var/log/vmkernel file. i am pretty sure your VMs performance will be slow as well, because vmkernel is busy handling bnx crashes.

Verify with VMware if you have appropriate Broadcom nic drivers. This is broadcom nic driver issue.

Reply
0 Kudos
CCarpenter
Contributor
Contributor

Yes, we are running broadcoms, but we are on 4.0, also today the NIC is dead dead. We already put in a request for a new mezzanine card. Plus I imagine if it were a compatibility issue we would have seen it on one of the other 79 hosts that are all running broadcoms in the same configuration. I also looked through the var logs and did not find any crash errors. Granted we are not running Jumbo frames, so that may be why we are not seeing the issue. I have only ever seen jumbo frames is on a storage network, ours if full fiber at this point running on q-logics. I feel confident that its just a bad nic. (at least in our case)

Reply
0 Kudos