I'm having a strange issue with one of our ESX 3 farms. Basically, when creating new VMs or migrating an already built VM with VMotion, I'm having intermittent network connections between the guest and the physical network.
My configuration:
Switch Name Num Ports Used Ports Configured Ports Uplinks
vSwitch0 32 5 32 vmnic2,vmnic0
PortGroup Name Internal ID VLAN ID Used Ports Uplinks
Service Console portgroup0 400 1 vmnic0,vmnic2
VMkernel portgroup2 401 1 vmnic0,vmnic2
Switch Name Num Ports Used Ports Configured Ports Uplinks
vSwitch1 64 6 64 vmnic3,vmnic1
PortGroup Name Internal ID VLAN ID Used Ports Uplinks
VLAN-409 portgroup4 409 0 vmnic1,vmnic3
VLAN-408 portgroup3 408 0 vmnic1,vmnic3
VLAN-410 portgroup5 410 3 vmnic1,vmnic3
So, for example, Ill have a VM running on VLAN 410 on host A and it is working fine, but as soon as I migrate it to host B, it is no longer on the network. In one occasion I kept pinging it and didnt touch anything and eventually it started working again. On other occasions, it never seems to come back on its own. Right now I have one host with three VMs on it, all on VLAN 410. Two of them are working fine, but the third is off the network. Sometimes I can ping the guests default gateway and the first ping attempt will fail but subsequent pings will start responding, and then the VM will be back online. Other times Ive had to disable/reenable the NIC within the guest OS (Windows in this case) and it starts responding on the network. All Virtual Machine port groups are configured the same on the entire DRS cluster (only 4 hosts on this one right now) and all physical switches are configured as identical trunks (I had the network team double-check this for me). We have another cluster in our non-prod environment and it has no such problems. The only difference is that the hosts in that cluster are currently running ESX 3.0.1 while this new cluster is 3.0.2.
Has anyone else ran into this problem? Any suggestions? Are there certain dos and donts for the physical switches that our network team needs to consider?
We are using HP c-class blade servers (BL465c G1) with the Cisco 3020 switches for the hardware in this environment.
Message was edited by:
allencrawford
In general it's best to configure gigabit NIC's for auto/auto on both the ESX server and physical switch port. Was this a new install of 3.0.2 or an upgrade from a previous version. You might try un-installing and re-installing VMware Tools on the VM's. The below doc has some good network troubleshooting tips in it.
Networking Scenarios & Troubleshooting - http://download3.vmware.com/vmworld/2006/tac9689-b.pdf
Hello,
Have you check the log in your physical switch.
This append often when a trunk is in error
This is a fresh install of 3.0.2. The VM I'm using to test also has a fresh install of VMware Tools (and it was a brand new OS).
I had the network guy check things out and he did not report any errors. Not to mention, other VMs work through the same virtual switch/trunk.
On that case, check the trunk configuration. The algorithm must be the same in the virtual switch than the physical.
This is how the physical switch is configured. What should I be checking on the virtual switch?
interface GigabitEthernet0/1
switchport trunk encapsulation dot1q
switchport trunk allowed vlan 400-410
switchport mode trunk
switchport nonegotiate
speed 1000
spanning-tree portfast
spanning-tree bpduguard enable
end
These two guides have alot of info on setting up your physical switches...
VMware ESX Server 3 802.1Q VLAN Solutions - http://www.vmware.com/pdf/esx3_vlan_wp.pdf
ESX Server, NIC Teaming and VLAN Trunking - http://blog.scottlowe.org/2006/12/04/esx-server-nic-teaming-and-vlan-trunking/
Also this...
http://virtrix.blogspot.com/2006/11/vmware-switch-load-balancing.html
http://www.cisco.com/univercd/cc/td/doc/solution/vmware.pdf
Message was edited by:
esiebert7625
What is you negotiation method on the ESX
Currently "auto", just like my Dev environment where this problem does not exist.
Change this setting for "Route based on source MAC hash" as the default negotiation protocol on Cisco switch.
Try ticking the 'Notify Switches' box in the actual port group settings. Even if it's already set to yes on the vSwitch level and therefore should flow down, I just came across the same prob and it was all I could think of. The prob seems to be fixed for me now, although it doesnt happen 100% of the time so it's hard to say whether it made any difference. It only appears to happen when you vmotion something onto a host that has no other VM's running on it, and hasn't had anything running on it since it booted. That is, the state an ESX box is in immediately after patching
Give it a try and if that solves it for you (or anyone else reading this), log a bug report with Vmware so we can get it looked at - the more bug reports, the better
We had/have this same problem!
VMWare support indicated that it's a problem with how their driver is talking to the physical NICs on the server. The only fix is to upgrade to 3.0.2. We have only seen this on one of our servers (we have 12 of the same model/NIC etc).
We have found when it loses connection just vMotioning to another box in the cluster re-enables the vNic and it can talk fine.
Well, it happens to me regardless of whether the destination host has any VMs on it. And it actually happens sometimes when I first turn the VM on. However, I have three port groups on my virtual switch so maybe it only happens when the destination host \*and* portgroup is empty. Will have to play with that. I will also try ticking the "Notify Switches" option and selecting "Yes" at the portgroup level and post back if I have any results. Though if it works, I'm going to be frustrated because we didn't have this problem when creating our first dev environment which has identical hardware and configurations. That or maybe we just added VMs too fast to it and I didn't notice...
Is your configuration similar to mine? I.E. are you (and n00dles for that matter) using VLAN tagging with your physical NICs trunked to the physical switches?
Also, it is not fixed in 3.0.2. That's what I was running when I came across this problem. I have since downgraded to 3.0.1 and the same patch level as the environment where I don't have the problem and it still exists.
This did NOT work. I ticked the box on all three port groups as well as the VMkernel. In addition, my hosts have not just been rebooted or anything and have other VMs on them, so this seems to be a different issue.
Thx for trying it out Allen, based on that I'll probably log a case with VMware, will report back if it gets resolved!
Sounds like a MAC table issue, or possible NIC Firmware issue.
Had a similar problem with HP P-class blades. Look at the thread here...
http://www.vmware.com/community/thread.jspa?threadID=98740
There is a known bug with the NICs in the p-class blades, may apply to c-class too. Upgrading to 3.0.2 is the only solution.
Also, looking at your port config, shouldn't it be:
spanning-tree portfast trunk
Sorry for the delay on this, but it turns out that the problem was some misconfigured VLANs on one of our distribution switches, not on the blade switches. Basically the VLAN was disabled (shutdown) because whomever created it never enabled it. So it was enabled on one distribution switch and disabled on another, which explains the randomness of this problem. Thanks to all who tried to help.