VMware Cloud Community
omatsei1
Contributor
Contributor

HA disconnected VM network?

Over the weekend, we had a serious power-related problem in our data center, which caused a few of our ESX servers to shut down. Fortunately, HA kicked in and brought all those VM's over to the unaffected ESX servers and started them back up. Unfortunately, it disabled the network on those VM's for some reason. Specifically, we noticed that after the VM's were started, our monitoring software was still showing them as offline. Upon investigation, we found that every VM that had been HA'd to a different host had it's network adapter disconnected (in the VM settings, when you select the network adapter, the top box saying "Connected" was unchecked). For a few VM's, that wouldn't be a huge problem, but with the 70-80 VM's that failed over this weekend, it became a huge ordeal to figure out which ones were working and which weren't...

Does anyone have any idea at all about how that checkbox was unchecked, and how to prevent that from happening in the future?

Tags (3)
0 Kudos
56 Replies
omatsei1
Contributor
Contributor

That's EXACTLY the problem we're having. Somehow I forgot to mention that simply checking the "Connected" box doesn't fix it... we have to select the network also (originally I was doing the same thing by selecting a different network, then changing it back, but I realized that if you drop down the menu and select the exact same thing, it'll work).

All of our VM's are using distributed switches, so that's potentially a common element.

0 Kudos
Chamon
Commander
Commander

Could it be an issue with VMXNET3 adapter? Is the hardware on the ESX hosts exactly the same? Specificaly the pNIC's?

0 Kudos
omatsei1
Contributor
Contributor

All of my physical hosts are exactly the same, racked right on top of each other, with all the same networks plugged into the same physical ports. My hardware is Dell Poweredge R610's.

I've come up with 2 theories. First, I've renamed the dvswitch from the default name (which is "dvswitch"). I'd imagine not a huge number of people require multiple dvswitches, so renaming them might not be a common practice. Second, we did upgrade from ESX 3, so most of our virtual hardware for the VM's is version 4 (version 7 is vSphere-specific). We just completed the upgrade about 2-3 weeks ago and haven't had a chance to upgrade the virtual hardware, since I imagine that requires a reboot of the VM. However, again, if others either didn't upgrade and built completely new VM's, or already upgraded the virtual hardware, I could see that too potentially being an issue.

0 Kudos
NTurnbull
Expert
Expert

Hi, would be interesting to see if there is anything in the virtual machine log file. If you browse the datastore the vm is on you'll see the log text files - Anything in there that looks out of place when the vm restarted on the other host?

Thanks,

Neil

Thanks, Neil
0 Kudos
omatsei1
Contributor
Contributor

Good call! I checked the log of one of the VM's, and it says:

Oct 24 08:23:36.588: vcpu-0| vmm32 initialized: Releasebuild-171294. cflags: 0x00000002.00001000.41808000.00500003

Oct 24 08:23:36.590: vcpu-0| Msg_Post: Error

Oct 24 08:23:36.590: vcpu-0| http://msg.ethernet.vlance.connectFailed Failed to connect ethernet0.

That's not terribly helpful, granted, but at least it recognizes that it failed to connect ethernet0.

0 Kudos
Chamon
Commander
Commander

I would say that you need to upgrade the virtual hardware and VMware tools.

0 Kudos
NTurnbull
Expert
Expert

OK, at least we've now got a timestamp to work with so we can correllate the other logs. For around that time what have you got in the /var/log/vmkernel and also in vSphere Server is there anything in the vpxd logs about reconfiguring virtual machines?

Thanks,

Neil

Thanks, Neil
0 Kudos
enettech
Contributor
Contributor

Hi Everyone,

We are certainly are talking about the same thing, sorry to throw a spanner in the works but we have the latest vmtools installed and are running the vm hardware vrsion (7). This is also a fresh install, not an upgrade, and all hardware is identical.......good to see it's not just us, but bad for others I suppose.

I also forgot to mention we're running the Cisco Nexus 1000v's.......is anyone else by chance and having the same problems????

Cheers,

Mick

0 Kudos
omatsei1
Contributor
Contributor

I should clarify about what I mean by "upgrade". We did fresh installs of ESX 4, then vmotioned our VM's over from ESX 3.0.2 to ESX 4 (we skipped 3.5). Almost all the VM's are still running the old vmware-tools from 3.0.2 also.

We're not running the Nexus 1000V... Just distributed virtual switches.

I submitted a service request with VMware, and will talk to them today. Hopefully they'll have some insight. I'll update as I know more...

0 Kudos
NTurnbull
Expert
Expert

Interesting, Your screenshot showed that your running the new guest hardware version and VMXNET3 which your old tools will not know about - there should have been a warning when you upgraded the guest hardware version warning you that the tools were not the latest did you want to cancel or continue with the upgrade?

Thanks,

Neil

Thanks, Neil
0 Kudos
omatsei1
Contributor
Contributor

That was enettech's screenshot... I haven't posted any yet, but the network adapter type listed in mine is "Flexible" with Virtual Machine Version 4.

0 Kudos
Chamon
Commander
Commander

I am sure you can't do this but it would be great to see if you moved to Standard vSwitches if this issue would remain. Please keep us posed as to what you hear from VMware.

0 Kudos
admin
Immortal
Immortal

Were your VMs in a steady state configuration? i.e. did the failover happen long (order of 10s of minutes) after you either provisioned the VM's or connected the VM's to a new portgroup or something along those lines.

0 Kudos
omatsei1
Contributor
Contributor

The vast majority of the VM's that failed over were in operation for at least 2 months before upgrading to vSphere, and about 3 weeks afterwards, before this failover event. Most of them have been around and functioning for 6-12 months under ESX 3.

0 Kudos
admin
Immortal
Immortal

What types were the portgroups in use by VMs that ended up getting disconnected after HA fialover?

We have three types of portgroups (static, dynamic and ephemeral). Were these VMs using dynamic or ephemeral portgroups on the vDS?

One way to check: from "VM sumamry tab" you can figure out which portgrups are in use by this VM and then you can check portgroup-type by looking at the "portgroup->edit properties" in VI client.

0 Kudos
Chamon
Commander
Commander

Are the hosts ESX clasic or ESXi

0 Kudos
enettech
Contributor
Contributor

Sorry I'm not sure who's asking questions of me but I'll answer what I can Smiley Happy

We're using ESX "classic" not ESXi

All our VM's are reltively new, eg. No production stuff, just built for test purposes, but have been deployed and connected to the network for several weeks now.

I've got a SR in with VMWare as well and have actually sent the tech a link to this thread, so he can see others are seeing the same problem. When I hear something I'll keep you all posted.

Mick

0 Kudos
enettech
Contributor
Contributor

Bit of an update from more testing......

No change of NIC type makes any difference E1000, vmxnet2 and vmxnet3 all get the same result

When checking out the VM log at boot time on the new host I get this.....

Oct 29 11:52:30.370: vcpu-0| VMXNET3 user: failed to connect Ethernet0 to DV Port 1431.

Oct 29 11:52:30.370: vcpu-0| Msg_Post: Warning

Oct 29 11:52:30.370: vcpu-0| http://msg.device.startdisconnected Virtual device Ethernet0 will start disconnected.

Not very descriptive I know but I thought this may help others.

Cheers,

Mick

0 Kudos
Chamon
Commander
Commander

Can you test it with a standard vswitch?

On Oct 28, 2009, at 10:16 PM, enettech <communities-

0 Kudos
enettech
Contributor
Contributor

mmmmm VMWare have asked the same. To do it I have to pretty much destroy all my networking of 10 hosts. If I could have a "mini" HA cluster within the one I currently have and only change this on 2 hosts it'd be fine but to create vSwitchs and migrate all networking on 10 is going to be a nightmare. I was hoping vmware could replicate the issue for me Smiley Happy

Mick

0 Kudos