Over the weekend, we had a serious power-related problem in our data center, which caused a few of our ESX servers to shut down. Fortunately, HA kicked in and brought all those VM's over to the unaffected ESX servers and started them back up. Unfortunately, it disabled the network on those VM's for some reason. Specifically, we noticed that after the VM's were started, our monitoring software was still showing them as offline. Upon investigation, we found that every VM that had been HA'd to a different host had it's network adapter disconnected (in the VM settings, when you select the network adapter, the top box saying "Connected" was unchecked). For a few VM's, that wouldn't be a huge problem, but with the 70-80 VM's that failed over this weekend, it became a huge ordeal to figure out which ones were working and which weren't...
Does anyone have any idea at all about how that checkbox was unchecked, and how to prevent that from happening in the future?
Were the vm networks needed by the vms available on all the hosts?
The VM's are only using 2 separate networks, but yes, those networks were available on all the hosts.
Also, ALL the VM's moved by HA had their networks disconnected... not just those moved to one or two specific host.
Were the networks available when the power outage happened? Also all the networks are labled the same on the servers? Was it only migrated guests that were affected? There is also a way to write a powershell script to enable this checkbox for you so you dont have to do it manually as well.
Yes, all the networks were available when the outage occurred. All the servers are configured identically (via host profiles) and using distributed virtual switches.
Yes, it only affected the guests that were migrated, not the ones already running on the unaffected hosts.
I was considering writing a powershell script to re-check that box, but I don't think it should be an issue in the first place. If it happens again, I'll have no choice (I'm certainly not going through 100 VM's checking for network connectivity again), but I don't think it should have happened at all. Also, if I did run a powershell script, it would re-connect the network adapters, but it wouldn't fix other problems that lingered as a result of the network being disconnected. For example, on Red Hat Enterprise 5, apparently if the network isn't connected, Apache doesn't start. That's one example of the problems we've been facing this weekend and this morning.
Maybe a stupid question but...... was the connect at power on check box checked on them?
And if you manually reboot the VM on the same host does this happen then as well?
Message was edited by: Chamon
At this point, nothing is a stupid question.
Yes, the "Connect at Power On" was checked.
How are your cluster HA settings set? Are your VMs set to power on after a failure here?
The full options we have set in the HA options are:
Enable Host Monitoring
Allow VMs to be powered on even if they violate availability constraints
No advanced options set
VM restart priority: Medium
Host Isolation Response: Power off
Disable VM Monitoring
I should also clarify that we have 5 hosts all around 30% usage. 3 of them dropped offline while 2 of them stayed online. With all the VM's running on 2 hosts, those hosts were running at around 75-80% capacity, so there was plenty (relatively) of resources available still, even after 3 hosts died.
Sorry they booted just didn't connect to the network.
When you reboot one of the VMs does the vNIC start connected? Can you vMotion with out any warnings? If you get warnings what are they?
I tried rebooting one of the VM's before re-connecting the vnic, but the vnic remained disconnected after the reboot. Subsequent reboots after I manually re-connected the vnic kept it connected like it should.
I can vMotion to any of the hosts with no errors or warnings at all.
Are there any errors in the HA logs?
Where can I find the HA logs?
/var/log/vmware/aam/
I checked 2 of the boxes, one that was unaffected by the power outage, and one that was taken down, and don't see any errors... but I'm not sure what I should be looking for... There are tons of files in there, and none of them seems to have very useful information to my untrained eye.
From Virtual center under the cluster, ESX host, and VM level is there anything unusual listed under tasks and events at the time the VMs were restarted?
Hi,
Glad I saw this post, we are experiencing the exact problem.
I'm in the process of building our new vSphere environment and have been carrying out some HA testing. I've been giving hosts a hard power off and once the VM has migrated to new host and booted it comes up with no networking and the "connected" check box is unticked. When I recheck it and hit ok I see the attached error. To work around this we have to select a different network under "Network Label" and then reselect the original, if we then chose OK the setting sticks.
I'm lost.....
Are you using vDS (distributed switch) or standard vswitch? I'm not implying it will necessarily make any difference for you, but wanted to narrow down the possible issues.