Dr_Virt
Enthusiast
Enthusiast

Strange new issue

Environment: Mixed - ESX 3.5 and vSphere4

Managment: Mixed - VirtualCenter 2.5 and vCenter 4

Issue: Recently we have begun to see a network issue arise within our ESX hosts. Recently I was patching one of my clusters (3 hosts). I placed a host in maintenance mode and verified DRS moved the guests. I began receiving alerts for 4 guests (14 moved) that were unavailable. I opened the console on these VMs and verified all was working correctly. I pinged the guests from the outside and received no response.

Resolution: I have found that simply disabling the vNIC and reenabling it resolves the issue.

Has anyone seen such an issue where vMotion causes the vNIC to fail? I have verified that the vSwitches are set to notify switches and that portfast is properly configured. Any ideas? This is causing some to lose faith in the vMotion benefits.

0 Kudos
3 Replies
RParker
Immortal
Immortal

If the switches are not EXACTLY the same name/configuration on ALL the hosts, you may get this issue. You normally would not have to do this, but I have seen this when there is slight difference in the settings between the switches.

And since you are updating, if you move the VM's to a host that ISN'T patched, the patch is probalby what corrects this error, so once its patched you should not get this any more....

0 Kudos
danm66
Expert
Expert

Make sure that notify switches option is set to yes on your VM portgroups.

0 Kudos
hicksj
Virtuoso
Virtuoso

We have seen the same issue. Patching ESX 3.5u3 to 3.5u5.

As we bring down each host after the updates, we're moving the hosts to a new switching environment (which is tied to our legacy core network). Both during maintenance mode migrations and standard DRS migrations, we're periodically seeing VMs go offline. The weird thing is, the only VMs that ever get disconnected are those who's vlan is a non-routed vlan at the core switch (Cisco VSS configuration). They all route through a (physical) firewall.

Generally, migrating to an alternate host restores connectivity. And then, we can turn around and migrate the VM back to the host on which it originally failed... and it works just fine.

Can you share additional upgrade details and infrastructure? I wonder if we're in similar situations here. We've completed all host patching & migrations last week, but still see a few firewalled systems drop off. Yesterday I moved the firewalls over to the new infrastructure and last night we still had one VM fail after DRS migrated it. I had hoped it was something funky where the switch notifications weren't somehow flowing across the LACP link between the old/new cores. But that can't be the problem now.

0 Kudos