Highlighted
Contributor
Contributor

HA disconnected VM network?

Over the weekend, we had a serious power-related problem in our data center, which caused a few of our ESX servers to shut down. Fortunately, HA kicked in and brought all those VM's over to the unaffected ESX servers and started them back up. Unfortunately, it disabled the network on those VM's for some reason. Specifically, we noticed that after the VM's were started, our monitoring software was still showing them as offline. Upon investigation, we found that every VM that had been HA'd to a different host had it's network adapter disconnected (in the VM settings, when you select the network adapter, the top box saying "Connected" was unchecked). For a few VM's, that wouldn't be a huge problem, but with the 70-80 VM's that failed over this weekend, it became a huge ordeal to figure out which ones were working and which weren't...

Does anyone have any idea at all about how that checkbox was unchecked, and how to prevent that from happening in the future?

Tags (3)
0 Kudos
56 Replies
Highlighted
Commander
Commander

Can you place two of the Hosts in maintenance mode and move all of the vms off. Then move these hosts to a new cluster and create standard vSwitch on them? Then test HA on these hosts? When you are done testing move the hosts back to the original cluster an add the vDS back to them?

0 Kudos
Highlighted
Contributor
Contributor

Smiley Happy You read my mind.....I've just started doing that....thanks anyway.

Mick

0 Kudos
Highlighted
Commander
Commander

Let us know what the results are.

0 Kudos
Highlighted
Contributor
Contributor

Hi,

Just a bit of an update.......

I've removed 6 hosts and moved them to their own cluster. On each of those I have configured Standard Switches, performed the same tests, and found that the problem disappears - the VMs return happily with network connectivity. I then created a new vDS for this cluster and migrated my hosts over to this and after trying the testing again (minus the Nexus) the exact problem still occurs and the exact process brings back the network connectivity - reselect the portgroup and recheck the "connected" box.

You may have already tried this and discovered the same but I thought I'd just let you know.

Mick

0 Kudos
Highlighted
Contributor
Contributor

I'm afraid I've been overwhelmed with separate issues and haven't had the

time to test our environment. However, I was contacted via private message

by someone from Vmware asking if I had submitted an SR, and if so, what the

number was, so she could consolidate them. I was told that they have

reproduced the error, and wanted to know if I'd be willing to test a patch

if/when they come up with one. Since we only have 5 hosts with the problem

(at the moment), we don't have extra ones to use for testing, so I declined,

after hearing that they have others who are willing to test. That was

sometime late last week, so hopefully we'll have a patch soon.

0 Kudos
Highlighted
Contributor
Contributor

Well it's good to hear there is prgress. I was conatcted as well for all the same stuff except there was no offer of a test patch......probably would be good for us as we have the luxury at the moment of a non-production system for this purpose, oh well I guess we'll just wait and see.

Thanks for the update.

0 Kudos
Highlighted
Enthusiast
Enthusiast

I`m experiencing exactly same issue, production enviroment though so it`s a lot of pain especialy that vcenter is a VM so when it goes down I need to do some magic and recreate standard switch just so vcenter starts so i can manage dvs again. We are using DVS with ESXi 4, fresh enviroment, hardware version 7, vmxnet3, latest patches. As mentioned - issue is related to DVS, if we switch do standard switch everything goes ok but I would really like to avoid using standard switch. Have a pending SR, no answer yet.

0 Kudos
Highlighted
Contributor
Contributor

Hey Bisti,

Sit tight on this, we've been told the fix will appear in vCenter Update 1 which from what I've been told is "about now" and what I've read says the 19th of Nov....today. I'm in Aus so waiting for the US to come online and hopefully put it up for download, fingers crossed.

0 Kudos
Highlighted
Contributor
Contributor

I'm having a nearly identical issue and am wondering if any progress was made on your problem. We too had some failures of hosts in our production environment and had the downed guests come up on other hosts (through DRS) with their networks disconnected. We however are not able to connect them with the drop box trick. We are forced to disconnect or shut down other VMs in order to connect those that exhibit the problem, or of course we can migrate the disconnected vms to other esx hosts that are not as heavily loaded and connect them there. Problem is that we typically get around 25 or so guests on a host before we see the problem, and if we have more than two hosts fail our average number of guests per host would exceed 25 - meaning we'd have to have some guests down until a host was restored.

The roughly-25 guest limit on an esx server makes little sense to us as we're not even close to loading up the resources on each physical server (4x Quad core boxes with 128 Gb of RAM). As a test I took some hardware we have in a disaster recovery datacenter and tried reproducing the problem. There I can load an esx host to 59 VMs before I see the problem occur on the 60th one that is powered up, and I have to power another one down or disconnect it to get any more to connect. But interestingly, this odd limit still exists even though again these are beefy boxes that are nowhere near capacity on CPU/RAM (well under 1/2 capacity according to vcenter).

We are not using DVS, as we just upgraded to vsphere 4u1 from 3.5 and still have our switches config from that. Whether tools is running or not or what version is running did not affect my test environment, it seems to disconnect the 60th VM no matter what. All VMs in my test environment are version 7 hardware due to the fact that I just created them, though some in my production environment are version 4. My test environment has two physical uplinks, and my production has three.

0 Kudos
Highlighted
Commander
Commander

How many ports are your vSwitches configured to have? You may be overloading them. You may need to increase the amount of ports.

0 Kudos
Highlighted
Contributor
Contributor

OMG I bet that's it, we never saw the problem before the upgrade because the limits used to be 56 ports, and now they are 24 after upgrading from CD. Over in the DR environment, the limits were 56 because we upgraded with update manager and it kept the settings. I had no idea that was there anyway - thanks for the help.

0 Kudos
Highlighted
VMware Employee
VMware Employee

it's a commonly made mistake. You will need to configure your vSwitches for the worst case scenario!



Duncan

VMware Communities User Moderator | VCP | VCDX

-


Now available: <a href="http://www.amazon.com/gp/product/1439263450?ie=UTF8&tag=yellowbricks-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=1439263450">Paper - vSphere 4.0 Quick Start Guide (via amazon.com)</a> | <a href="http://www.lulu.com/product/download/vsphere-40-quick-start-guide/6169778">PDF (via lulu.com)</a>

Blogging: http://www.yellow-bricks.com | Twitter: http://www.twitter.com/DuncanYB

0 Kudos
Highlighted
Contributor
Contributor

Hi Chamon,

Thank you. Your answer fix the problem on our infrastructure.

Regards,

Regis

0 Kudos
Highlighted
Enthusiast
Enthusiast

Hey

This something the same on what happened to our environment.

We had power outage on the whole datacenter on one of our site, then HA disconnected the VMs Network Adapter after everything back powered on.

The "Connected" checkbox in adapter settings was unchecked.

What was the root cause that leads unchecking the "Connected" box in adapter settings on all VMs?

We are using standard vswitch on a aggregated 2 vmnic with 10gb. We are in vSphere 5.

0 Kudos
Highlighted
Enthusiast
Enthusiast

Make sure the number of available ports on the vswitch is not changed from the default or you change it back to 120.
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100910...

some1 faced a similar problem here
http://communities.vmware.com/thread/307047

---------------------- Gajendra D Ambi [pardon my chat lingo]
0 Kudos
Highlighted
Enthusiast
Enthusiast

Make sure the number of available ports on the vswitch is not changed from the default or you change it back to 120.
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100910...

some1 faced a similar problem here
http://communities.vmware.com/thread/307047

---------------------- Gajendra D Ambi [pardon my chat lingo]
0 Kudos
Highlighted
Enthusiast
Enthusiast

Its in default # 120 ports.

0 Kudos