VMware Cloud Community
omatsei1
Contributor
Contributor

HA disconnected VM network?

Over the weekend, we had a serious power-related problem in our data center, which caused a few of our ESX servers to shut down. Fortunately, HA kicked in and brought all those VM's over to the unaffected ESX servers and started them back up. Unfortunately, it disabled the network on those VM's for some reason. Specifically, we noticed that after the VM's were started, our monitoring software was still showing them as offline. Upon investigation, we found that every VM that had been HA'd to a different host had it's network adapter disconnected (in the VM settings, when you select the network adapter, the top box saying "Connected" was unchecked). For a few VM's, that wouldn't be a huge problem, but with the 70-80 VM's that failed over this weekend, it became a huge ordeal to figure out which ones were working and which weren't...

Does anyone have any idea at all about how that checkbox was unchecked, and how to prevent that from happening in the future?

Tags (3)
0 Kudos
56 Replies
admin
Immortal
Immortal

Were the vm networks needed by the vms available on all the hosts?

0 Kudos
omatsei1
Contributor
Contributor

The VM's are only using 2 separate networks, but yes, those networks were available on all the hosts.

Also, ALL the VM's moved by HA had their networks disconnected... not just those moved to one or two specific host.

0 Kudos
jpwalsh
Contributor
Contributor

Were the networks available when the power outage happened? Also all the networks are labled the same on the servers? Was it only migrated guests that were affected? There is also a way to write a powershell script to enable this checkbox for you so you dont have to do it manually as well.

0 Kudos
omatsei1
Contributor
Contributor

Yes, all the networks were available when the outage occurred. All the servers are configured identically (via host profiles) and using distributed virtual switches.

Yes, it only affected the guests that were migrated, not the ones already running on the unaffected hosts.

I was considering writing a powershell script to re-check that box, but I don't think it should be an issue in the first place. If it happens again, I'll have no choice (I'm certainly not going through 100 VM's checking for network connectivity again), but I don't think it should have happened at all. Also, if I did run a powershell script, it would re-connect the network adapters, but it wouldn't fix other problems that lingered as a result of the network being disconnected. For example, on Red Hat Enterprise 5, apparently if the network isn't connected, Apache doesn't start. That's one example of the problems we've been facing this weekend and this morning.

0 Kudos
Chamon
Commander
Commander

Maybe a stupid question but...... was the connect at power on check box checked on them?

And if you manually reboot the VM on the same host does this happen then as well?

Message was edited by: Chamon

0 Kudos
omatsei1
Contributor
Contributor

At this point, nothing is a stupid question. Smiley Happy

Yes, the "Connect at Power On" was checked.

0 Kudos
Chamon
Commander
Commander

How are your cluster HA settings set? Are your VMs set to power on after a failure here?

0 Kudos
omatsei1
Contributor
Contributor

The full options we have set in the HA options are:

Enable Host Monitoring

Allow VMs to be powered on even if they violate availability constraints

No advanced options set

VM restart priority: Medium

Host Isolation Response: Power off

Disable VM Monitoring

0 Kudos
omatsei1
Contributor
Contributor

I should also clarify that we have 5 hosts all around 30% usage. 3 of them dropped offline while 2 of them stayed online. With all the VM's running on 2 hosts, those hosts were running at around 75-80% capacity, so there was plenty (relatively) of resources available still, even after 3 hosts died.

0 Kudos
Chamon
Commander
Commander

Sorry they booted just didn't connect to the network.

0 Kudos
Chamon
Commander
Commander

When you reboot one of the VMs does the vNIC start connected? Can you vMotion with out any warnings? If you get warnings what are they?

0 Kudos
omatsei1
Contributor
Contributor

I tried rebooting one of the VM's before re-connecting the vnic, but the vnic remained disconnected after the reboot. Subsequent reboots after I manually re-connected the vnic kept it connected like it should.

I can vMotion to any of the hosts with no errors or warnings at all.

0 Kudos
Chamon
Commander
Commander

Are there any errors in the HA logs?

0 Kudos
omatsei1
Contributor
Contributor

Where can I find the HA logs?

0 Kudos
Chamon
Commander
Commander

/var/log/vmware/aam/

0 Kudos
omatsei1
Contributor
Contributor

I checked 2 of the boxes, one that was unaffected by the power outage, and one that was taken down, and don't see any errors... but I'm not sure what I should be looking for... There are tons of files in there, and none of them seems to have very useful information to my untrained eye.

0 Kudos
Chamon
Commander
Commander

From Virtual center under the cluster, ESX host, and VM level is there anything unusual listed under tasks and events at the time the VMs were restarted?

0 Kudos
enettech
Contributor
Contributor

Hi,

Glad I saw this post, we are experiencing the exact problem.

I'm in the process of building our new vSphere environment and have been carrying out some HA testing. I've been giving hosts a hard power off and once the VM has migrated to new host and booted it comes up with no networking and the "connected" check box is unticked. When I recheck it and hit ok I see the attached error. To work around this we have to select a different network under "Network Label" and then reselect the original, if we then chose OK the setting sticks.

I'm lost.....

0 Kudos
howie
Enthusiast
Enthusiast

Are you using vDS (distributed switch) or standard vswitch? I'm not implying it will necessarily make any difference for you, but wanted to narrow down the possible issues.

0 Kudos