network "dvswitch" not accessable on shared cluste...

Digitalman · ‎01-23-2014

Hello!

A have a cluster of 6 hosts running ESXi 5.5, v 1331820. All are joined to a v5.5 dvSwitch named "dvSwitch-Nexus"

When I attempt to vMotion a VM from one host to another, I recieve the message "currently connected network interface 'Network adapter 1" uses network "dvSwitch-Nexus" which is not accessable."

When I searched on this error, it was in reference to migrations between different dvSwitches. But in this case, all of the hosts and the VM in question are all attached to the same dvSwitch.

Now, I have vCenter Operations running, and they're reporting that my hosts lost network redundancy. But all of them have dual NICs online and operational. I've taken each NIC out of the active order individually, triggered the alert on the host itself, then readded it to make it go away. But the vCOPS error remains. Not sure if this is related.

The only other thing I can think of that might be related to this issue is the LAG I originally set up. My hosts are Cisco UCS B200 M3s, and after some reading and consultation with our Cisco engineer, we removed the LAGs and migrated the interfaces back to standard uplinks. The vCOPS error seems to have begun soon after those were migrated back and the LAG removed. But even with two active uplinks and everything looking visible, I still can't vmotion across my dvSwitch.

My next step would be to remove the hosts from the dvSwitch and readd them, but I've VMs running on all of the hosts and because I can't vMotion them off, removing the hosts will be difficult. I can create standard portgroups and migrate them to those, and move the uplinks over to standard switches, but before I start down that path I thought I'd consult you all since there may be something I missed.

Any help or suggestions would be great, and I'm happy to answer any further questions. Many thanks!

grasshopper · ‎01-24-2014

I've seen this one before. I don't recall how I fixed it but I'm sure we can muscle through it. To start with, you can confirm that your DVS uuid does in fact match that of a VM in question by performing the following:

Get the UUID of the DVS (1000v):

vemcmd show card | grep uuid

From an example VM, cd into it's directory and perform the following:

cat ./vmware*.log | grep dvs.switchId

You should see that the uuid's match as suspected. When I observed this issue, mine matched as well. BTW, I like the way you think on the removing and re-adding the hosts (and final kill move of going VSS temporarily). You might also consider setting DRS to manual to avoid the errors for now.

Anyway, next step would be to confirm the health of the VEM to VSM communication. You can do this by performing the following:

1. Get the Primary VSM's MAC address from your network guy (or perform a 'vemcmd show card | grep MAC' from an example VMHost)
2. Perform a 'vem-health check <VSM MAC Address>' on each host in the cluster and review the output. If there are communication issues on the control VLAN for example, it will tell you. For example, a good response is "The VEM-VSM connectivity seems to be fine". An example of a bad response is "VSM heartbeats are not reaching the VEM" as listed below:

~ # vemcmd show card | grep 'Primary VSM MAC'

Primary VSM MAC : 00:50:56:bc:00:19

~ #

~ # vem-health check 00:50:56:bc:00:19
VSM Control MAC address: 00:50:56:bc:00:19
Control VLAN: 91
DPA MAC: 00:02:3d:40:01:03

VSM heartbeats are not reaching the VEM.
More than one uplink cannot be in the same VLAN without forming a port channel (PC).
Recommended action:
Make sure that your uplink configuration is correct.
If your uplinks are using a PC profile, check if the VEM's
upstream switch has learned the VSM's Control MAC.
~ #

As always, you should grab a vm-support and a 'vem-support all' in case you need to dig deeper. Another option you could consider is to simply initiate a VSM failover and see if things clear up. Make sure you have a CDP report handy and that your network folks have confirmed upstream is clean.

Digitalman · ‎01-29-2014

Grasshopper,

First off, many thanks for your informative reply. I greatly appreciate it!

I got impatient a few days ago (before I saw your reply) and removed the hosts from the dvSwitch; migrating everything off to standard switches.Everything works fine now; and I can rule out actual network connectivity since the vNICs I presented in Cisco UCS were constructed from the same template; standard and distributed switch alike.

One thing that a coworker suggested though, that perhaps what had happened was due to a mixup we had early on. We're new at running VMware in a Cisco UCS enviroment, and we'd originally configured our dvSwitch uplinks to use a LAG. However, after seeing various degrees of packet loss, and some discussion with our Cisco engineer, we migrated off the LAG and back onto standard uplinks (The KB article, for the curious)

Now I migrated off the LAG in the same way I migrated onto it: Moved the standard uplinks into standby, unassigned/reassigned the uplinks from the LAG to the standard ports, and then removed the LAG from the group. My theory is that something maybe got left behind (even though we deleted the LAG) and perhaps that is the cause behind the reported loss of connectivity.

The good news is that I can test this theory now that all of our VMs are sitting on standard switches, so I'm going to do that over the next couple of days. I'll recreate the dvSwitch, add a test VM, move onto a LAG, test, then off, and see if I can recreate the error. If so, then we know something's up with getting off a LAG, and I'll perform the vem health check that you suggested.

Thanks for your help, and I'll keep you all posted. 😃

~Chris

All

network "dvswitch" not accessable on shared cluster and dvSwitch