Solved: Re: Random virtual machines lose network after vMo... - Page 2

wkucardinal · ‎05-24-2011

Hi all -

I have a couple of tickets open with VMware and our SAN vendor, EqualLogic, on this issue. Since configuring our production and DMZ clusters we have been noticing that virtual machines will sometimes drop network connectivity after a successful vMotion or Storage vMotion. Occasionally, though far less frequently, virtual machines will also spontaneously lose network over night. This has only happened a few times. The strange thing is that other guests on the VM host are fine - they do not lose network at all. In fact, I can fail over 3 virtual machines from one host to another, and 2 of the 3 may fail over correctly, while one will lose network. The workaround? Simply "disconnect" the virtual NIC and "reconnect" it, and the VM will start returning packets. I can also fail the troubled VM back over to the prior host and it will regain network. I can reboot it and it will re-gain network. I can re-install the virtual adapter completely, and it will re-gain network.

VMware saw a bunch of SAN errors in our log files so we updated our SAN firmware to the latest version. That seems to have fixed that but we still have the issue. Here are some of the specs - all environments are virtually identical except for memory:

PowerEdge R810's

Broadcom 5709 NICs

EqualLogic SAN running 5.0.5 F/W

We are using jumbo frames. ESXi is fully-patched. I have not seen a pattern regarding whether or not it is only certain guest OS that lose network but we are primarily a Windows environment.

When a virtual machine loses network, we cannot:

ping to it
ping from it
ping from it to virtual machines on the same host or vSwitch
ping outside our network
resolve DNS, etc.

I have followed certain VMware KBs to no success, including:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100383...
http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&externalId=1002811 (Port Security is not enabled)

-All VMware tools have been updated to the latest correct version and match the ESXi host
-Logged onto the ESXi service console, I cannot ping the trouble VM by host name or by IP address, but I can ping OTHER virtual machines not experiencing the issue. I also can ping external from the service console.
-Logged into the troubled VM itself, I cannot ping other VMs, I cannot resolve host names, I cannot ping by IP. The VM CAN ping itself by IP but not by hostname. I cannot ping other VMs on the same virtual switch or network by either IP or host name. I cannot ping the management network vSwitch.
-All vSwitches are configured identically and named the same.
-Notify switches is set to yes
-There are plenty of available virtual ports
-We have tried both E1000 and VMXNET virtual adapters with no difference.
-All adapters are configured to negotiate, but we have tried forcing particular speeds as well with no difference

I do appreciate your help. I am having trouble getting anywhere on this issue with the vendors.

wkucardinal · ‎05-31-2011

We have enabled portfast on all uplinks and the problem is still present. At this point I am at a loss.

We also have portfast enabled on all SAN connections. However, in our DR environment we do not have portfast enabled anywhere, yet we do not see this issue over there.

ats0401 · ‎05-31-2011

hmm.

Can you take out the usernames/passwords, Ip addresses, etc and post the network switch config?

Also, with the switch console open and you do the vmotions and cause the problem, are there errors coming across the switch console?

Can you upload the switch error log from a time during when the problem happens?

Without these, it will be tough to troubleshoot.

Also, have you checked the vcenter/ESX logs as well?

wtfmatt · ‎05-31-2011

This may be shortsighted, but I thought I'd mention it in case anyone else has noticed as well. I wouldn't be surprised if the broadcom NIC's are the culprit.

We have had nothing but problems with Broadcom cards. Our SAN vendor (cybernetics) no longer supports broadcom NIC's in any capacity, and I've also heard of broadcom cards causing similar issues with other products as well. (Random connectivity drops, packet loss)

In addition to the internal problems I had with our SAN dropping connectivity randomly overnight, we have had 2 of our clients run into the same issue. All running broadcoms. Granted, these have all been with cybernetics sans and Cybernetics (in my opinion) is the KIA of the SAN world but ever since we've swapped out our broadcom cards with Intels we have had no issues.

I'm not familiar with equalogic, but it might be worth asking them if they've seen any repetitive issues with broadcom cards and their devices.

wkucardinal · ‎05-31-2011

What does that say about Kia?

I think we *may* have possibly solved this issue in our production cluster, but I'm still testing. We are still seeing this in our DMZ cluster, however, and we're about to set up monitoring to see what's going on.

The issue we found in our production cluster is that we had one VM uplink port, on only 1 of 3 hosts, that was assigned to native vlan 35, which is our vlan for iSCSI traffic. Once we removed this setting (it should have been a different vlan), then we have now been able to vmotion to all 3 hosts in the cluster without losing more than a packet here and there. What I don't understand is how can one port on one host affect lost networking when doing a vMotion throughout that 3-host cluster? Our issue has not been isolated to the one host with the misconfigured switchport.

Now, the one thing about our DMZ cluster switches is that they are dumbed down. We basically have no port configurations on those switches because it is a very small dmz cluster. Could this cause a problem? Is there a minimum configuration that's required for reliable service?

I will say that I have suspected the Broadcom nics in the past but since we use them pretty heavily and we really haven't seen any issues elsewhere, I could never zero in on that.

ats0401 · ‎06-02-2011

Glad you got the production cluster fixed. I think for sure that the misconfiguration could cause these this type of issue.

For the storage switch, I would recommend portfast (or disable STP), match MTU for the network (9000 for jumbo frames), and flow control receive on.

wkucardinal · ‎06-02-2011

What do you think about the DMZ cluster? We are using Catalyst 2950 switches on that side, running at only 100 MB for the network ports. There is basically no configuration on any of the uplink ports and they are assigned VLAN 1. I am told by our network guys that VLAN 1 is technically shut down and does not forward that traffic off the switch since it is our DMZ. Does this make any sense? All ports are set up for access mode so it is more-or-less a dumb switch, acting like a hub in some ways.

We see VMs lose network on this side after the vmotion, almost at will. We tried enabling portfast, but that didn't help.

ats0401 · ‎06-02-2011

I am not sure what else could be causing this - do you have another switch you could temporarily swap out and try for the network side?

Only 100MB? vMotion should have 1GB interface. Not sure it's the problem, but vMotion requires a lot more then this.

From Duncan's blog -

An IP network with a minimum bandwidth of 622 Mbps is required.
The maximum latency between the two VMware vSphere servers cannot exceed 5 milliseconds (ms).

edit: that is for long distance vmotion

regular vmotion, the document I found says it requires 1GB

http://pubs.vmware.com/vsp40/wwhelp/wwhimpl/js/html/wwhelp.htm#href=admin/c_vmotion_networking_requi...

wkucardinal · ‎06-02-2011

We have discussed trying to swap out the switch and that may be what we have to try. The vMotion interface is 1 GB, but the pure business network interface is only 100 MB.

wkucardinal · ‎06-20-2011

Well, I thought we had this solved in our production cluster but it reared its ugly head this past week, once again. At this point I am out of ideas. Does anyone have any more suggestions not already posted in this thread?

wkucardinal · ‎11-09-2011

Still having the issue..

rickardnobel · ‎11-09-2011

wkucardinal wrote:
Still having the issue..

Sometimes it could be useful to really verify that all VMNIC uplinks for all VLAN does work. One way to try this is to create a new portgroup on the vSwitch used by your VMs on the first host, put one test VM on this portgroup, then go into the NIC teaming policy of the new portgroup and select "Override switch failover order".

Then move down all VMNICs except one to unused, so only one VMNIC is active. Then set the portgroup VLAN settings to one of the production VLANs and try to see if you could ping some different expected addresses. If this works, then move the working VMNIC down to unused and move up another to Active. Try again, and do this for all VMNICs. If this works then you have verifed that the VLAN configuration and other settings are correct on the physical switch ports facing this host.

If having multiple VLANs, repeat the process for all other production VLANs. Then repeat the process on the other hosts.

While might take a while, does will verify if everything is correctly configured on the physical switches. When a vMotion takes place the VM gets a new "Port ID" and is assigned a new outgoing VMNIC. If there is a configuration error on one or several physical switch port this could seem random, but perhaps always happens on VLAN x on VMNIC y. Since the Port ID policy you are using in effect will randomly spread the VMs over the VMNICs these problems could be hard to diagnose. (Doing a disconnect of the Virtual Machine vNIC gives the VM a new port-ID which will move it to a new outgoing VMNIC, which might seem to solve the problem.)

My VMware blog: www.rickardnobel.se

wkucardinal · ‎11-09-2011

Rickard -

That is an obvious suggestion and I can't believe we didn't try it months ago. We were able to identify one bad port doing what you said on one of our 3 hosts and have fixed that port config. We will continue testing. We also have the issue in a separate DMZ cluster but we haven't been able to troubleshoot there yet. Once we are done with testing I will report back.

Thanks for the help.

Brandon

rickardnobel · ‎11-09-2011

Glad to hear to you found one configuration error already! Report back when you have completed the testing.

My VMware blog: www.rickardnobel.se

wkucardinal · ‎11-16-2011

I found a similar misconfiguration in our DMZ and made the correction there. We are going to simulate failures in our cluster tonight to see if this is truly resolved. Typically we will see this problem if we have a host go down.

wkucardinal · ‎11-16-2011

Rickard's suggestion was very helpful. We were able to isolate more than one port in each of our trouble environments and correct them using his suggestion. I believe our "dropped network" issues have been resolved after testing. If not, I'll be back.

rickardnobel · ‎11-18-2011

Very nice to hear that you got your old problem finally solved. Good luck!

My VMware blog: www.rickardnobel.se

All

Random virtual machines lose network after vMotion or Storage vMotion