VMs losing network connection

allencrawford · ‎08-20-2007

I'm having a strange issue with one of our ESX 3 farms. Basically, when creating new VMs or migrating an already built VM with VMotion, I'm having intermittent network connections between the guest and the physical network.

My configuration:

Switch Name Num Ports Used Ports Configured Ports Uplinks

vSwitch0 32 5 32 vmnic2,vmnic0

PortGroup Name Internal ID VLAN ID Used Ports Uplinks

Service Console portgroup0 400 1 vmnic0,vmnic2

VMkernel portgroup2 401 1 vmnic0,vmnic2

Switch Name Num Ports Used Ports Configured Ports Uplinks

vSwitch1 64 6 64 vmnic3,vmnic1

PortGroup Name Internal ID VLAN ID Used Ports Uplinks

VLAN-409 portgroup4 409 0 vmnic1,vmnic3

VLAN-408 portgroup3 408 0 vmnic1,vmnic3

VLAN-410 portgroup5 410 3 vmnic1,vmnic3

So, for example, I’ll have a VM running on VLAN 410 on host A and it is working fine, but as soon as I migrate it to host B, it is no longer on the network. In one occasion I kept pinging it and didn’t touch anything and eventually it started working again. On other occasions, it never seems to come back on its own. Right now I have one host with three VMs on it, all on VLAN 410. Two of them are working fine, but the third is off the network. Sometimes I can ping the guest’s default gateway and the first ping attempt will fail but subsequent pings will start responding, and then the VM will be back online. Other times I’ve had to disable/reenable the NIC within the guest OS (Windows in this case) and it starts responding on the network. All Virtual Machine port groups are configured the same on the entire DRS cluster (only 4 hosts on this one right now) and all physical switches are configured as identical trunks (I had the network team double-check this for me). We have another cluster in our non-prod environment and it has no such problems. The only difference is that the hosts in that cluster are currently running ESX 3.0.1 while this new cluster is 3.0.2.

Has anyone else ran into this problem? Any suggestions? Are there certain do’s and don’t’s for the physical switches that our network team needs to consider?

We are using HP c-class blade servers (BL465c G1) with the Cisco 3020 switches for the hardware in this environment.

Message was edited by:

allencrawford

esiebert7625 · ‎08-20-2007

In general it's best to configure gigabit NIC's for auto/auto on both the ESX server and physical switch port. Was this a new install of 3.0.2 or an upgrade from a previous version. You might try un-installing and re-installing VMware Tools on the VM's. The below doc has some good network troubleshooting tips in it.

Networking Scenarios & Troubleshooting - http://download3.vmware.com/vmworld/2006/tac9689-b.pdf

admin · ‎08-20-2007

Hello,

Have you check the log in your physical switch.

This append often when a trunk is in error

allencrawford · ‎08-21-2007

This is a fresh install of 3.0.2. The VM I'm using to test also has a fresh install of VMware Tools (and it was a brand new OS).

allencrawford · ‎08-21-2007

I had the network guy check things out and he did not report any errors. Not to mention, other VMs work through the same virtual switch/trunk.

admin · ‎08-21-2007

On that case, check the trunk configuration. The algorithm must be the same in the virtual switch than the physical.

allencrawford · ‎08-21-2007

This is how the physical switch is configured. What should I be checking on the virtual switch?

interface GigabitEthernet0/1

switchport trunk encapsulation dot1q

switchport trunk allowed vlan 400-410

switchport mode trunk

switchport nonegotiate

speed 1000

spanning-tree portfast

spanning-tree bpduguard enable

end

esiebert7625 · ‎08-21-2007

These two guides have alot of info on setting up your physical switches...

VMware ESX Server 3 802.1Q VLAN Solutions - http://www.vmware.com/pdf/esx3_vlan_wp.pdf

ESX Server, NIC Teaming and VLAN Trunking - http://blog.scottlowe.org/2006/12/04/esx-server-nic-teaming-and-vlan-trunking/

Also this...

http://virtrix.blogspot.com/2006/11/vmware-switch-load-balancing.html

http://www.cisco.com/univercd/cc/td/doc/solution/vmware.pdf

Message was edited by:

esiebert7625

admin · ‎08-21-2007

What is you negotiation method on the ESX

allencrawford · ‎08-21-2007

Currently "auto", just like my Dev environment where this problem does not exist.

admin · ‎08-21-2007

Change this setting for "Route based on source MAC hash" as the default negotiation protocol on Cisco switch.

n00dles · ‎08-23-2007

Try ticking the 'Notify Switches' box in the actual port group settings. Even if it's already set to yes on the vSwitch level and therefore should flow down, I just came across the same prob and it was all I could think of. The prob seems to be fixed for me now, although it doesnt happen 100% of the time so it's hard to say whether it made any difference. It only appears to happen when you vmotion something onto a host that has no other VM's running on it, and hasn't had anything running on it since it booted. That is, the state an ESX box is in immediately after patching

Give it a try and if that solves it for you (or anyone else reading this), log a bug report with Vmware so we can get it looked at - the more bug reports, the better

Paul_B1 · ‎08-23-2007

We had/have this same problem!

VMWare support indicated that it's a problem with how their driver is talking to the physical NICs on the server. The only fix is to upgrade to 3.0.2. We have only seen this on one of our servers (we have 12 of the same model/NIC etc).

We have found when it loses connection just vMotioning to another box in the cluster re-enables the vNic and it can talk fine.

allencrawford · ‎08-23-2007

Well, it happens to me regardless of whether the destination host has any VMs on it. And it actually happens sometimes when I first turn the VM on. However, I have three port groups on my virtual switch so maybe it only happens when the destination host \*and* portgroup is empty. Will have to play with that. I will also try ticking the "Notify Switches" option and selecting "Yes" at the portgroup level and post back if I have any results. Though if it works, I'm going to be frustrated because we didn't have this problem when creating our first dev environment which has identical hardware and configurations. That or maybe we just added VMs too fast to it and I didn't notice...

allencrawford · ‎08-23-2007

Is your configuration similar to mine? I.E. are you (and n00dles for that matter) using VLAN tagging with your physical NICs trunked to the physical switches?

Also, it is not fixed in 3.0.2. That's what I was running when I came across this problem. I have since downgraded to 3.0.1 and the same patch level as the environment where I don't have the problem and it still exists.

allencrawford · ‎08-23-2007

This did NOT work. I ticked the box on all three port groups as well as the VMkernel. In addition, my hosts have not just been rebooted or anything and have other VMs on them, so this seems to be a different issue.

n00dles · ‎08-24-2007

Thx for trying it out Allen, based on that I'll probably log a case with VMware, will report back if it gets resolved!

murreyaw · ‎08-24-2007

Sounds like a MAC table issue, or possible NIC Firmware issue.

Vandalay · ‎08-24-2007

Had a similar problem with HP P-class blades. Look at the thread here...

http://www.vmware.com/community/thread.jspa?threadID=98740

There is a known bug with the NICs in the p-class blades, may apply to c-class too. Upgrading to 3.0.2 is the only solution.

Also, looking at your port config, shouldn't it be:

spanning-tree portfast trunk

allencrawford · ‎10-09-2007

Sorry for the delay on this, but it turns out that the problem was some misconfigured VLANs on one of our distribution switches, not on the blade switches. Basically the VLAN was disabled (shutdown) because whomever created it never enabled it. So it was enabled on one distribution switch and disabled on another, which explains the randomness of this problem. Thanks to all who tried to help.