VMware Cloud Community
vyking56
Contributor
Contributor

Vsphere 4 environment - Guests randomly losing network connectivity

Hi All,

I have inherited a Vsphere 4 environment, but every few weeks the virtual guests (at complete random) lose their network connectivity.

The environment has 8 hosts, ESXi 4.0.0 ranging between version 162856 - 261974. They are configured for high availability.

Until now, virtual guests (Server 2003, 2008, Linux) have on occasion "lost" their network connectivity. Restarting the guest OS hasn't rectified the fault, we've had to to either v-motion them to a different host, or reset them under virtual power before the NIC will become responsive again.

Over this weekend we lost approx 30 virtual machines, so the problem is becoming much more serious.

The guests seem to be faulting at complete random, with no common hardware between them; Host, storage, physical network switch etc.

I have even vmotioned a guest from host#1 to host#2 (which got it working again) and then vmotioned it back to host#1 again, where it continues to run quite happily.

Besides tearing my hair out I'm at a complete loss to explain how and why this is happening. Has anyone experienced problem(s) like this before in a ESXi 4 environment?

Cheers,

Gareth

0 Kudos
7 Replies
admin
Immortal
Immortal

Welcome to VMware Communities. Your thread was moved to the ESXi4 forum.

is for questions about use of the forums and account problems only.

---

Alex Maier

VMware Communities Manager

0 Kudos
Jack10808
Enthusiast
Enthusiast

I am having the same problem, but I run vSphere ESX 4.1 and it happens to me since version 3.5.

I have trunk ports going to the each server for the data ports and manage everything with VLAN tagging.

DRS causes a threat to my production environment because when it migrates a machines there is a 50/50 chance that it will loose connection to the network.

I manage multiple vmware environments and have been on multiple companies with vmware and never seen anything like it.

Any help from the community will be appreciated.

0 Kudos
vyking56
Contributor
Contributor

I have since disabled "automatic DRS" in the interim and have been stable since.

Interestingly enough when a VM loses connectivity to the LAN it's still able to ping other VMs on the same Vswitch.

Sounds like we a very similar problem here Jack10808.

0 Kudos
a_p_
Leadership
Leadership

This sounds like a configuration issue on the physical switch ports. Maybe spanning tree or port security in enabled on one or more of the ports?!

Can you post the current configuration of one of the switch ports.

André

0 Kudos
MShestakov
Contributor
Contributor

Good day, everybody!

I've registered to start independent topic, but it seems that I'm not along with my problem.

Looks like I'm having the same issue as topic starter.

First of all - our configuration:

We have 6 ESXi 4.1 hosts based on SunFire X4600 servers and managed by one vCenter. Network configured to use VSM switch. Physical ports on ESXi's are in trunk, multiple port groups with different VLANs configured on VSM by our network gurus. Storage is delivered by FC. Two hundred VM's and growing. Nothing special.

Problem occurred two days ago.

DRS has initiated a vMotion of one host to another and put unused esxi's to standby mode. vMotion done its job, but several VM's in different VLANs lost their connectivity. These VM's could ping nothing except itself.

We tried to reinstall VMware tools, disable and enable network interface from Operating system (Win2k3 and Ubuntu 10.04), remove and add network interface from vCenter with different drivers and attach it to different VLANs. Somehow these shaman dancing enabled connectivity.

But. Last night several others VM's during another one vMotion initiated by DRS lost their connectivity with same as day before symptoms. I decided to migrate (change host) back to their "old place" (before nightly vMotion) and network came back!

All day googling did not showed any useful suggests about such kind of troubles.

We have no port security, network guys checked VSM and said "its up and running, no errors", there were no spanning tree events, no duplicate mac's, nothing unusual. Also nobody did any serious reconfiguration of VSM or ESXi since last upgrade to version 4.1 in July.

I took one test vm and started to migrate it from one ESXi host to another. After third migration VM lost its network as others before. When I migrated it to "one step back", the network has become available.

For now everything is working. But i had to disable DRS.

We are using vSphere for less than a year, and it is first huge problem that we can't resolve.

Any ideas what could happen with well configured and running platform? )

0 Kudos
AntonVZhbankov
Immortal
Immortal

MShestakov, I suppose it would be more comfortable for you to discuss your problem here in Russian.

And since you have a big number of VMs looks like you've just ran out of vSwitch capacity. I've met problem possibly related to yours some time ago:

http://blog.vadmin.ru/2009/01/blog-post_26.html

http://blog.vadmin.ru/2009/01/vswitch.html

Reconfigure vSwitches to be little bigger, 120+ ports per vSwitch, reboot hosts and try vMotion VMs again.


---

MCITP: SA+VA, VCP 3/4, VMware vExpert

http://blog.vadmin.ru

EMCCAe, HPE ASE, MCITP: SA+VA, VCP 3/4/5, VMware vExpert XO (14 stars)
VMUG Russia Leader
http://t.me/beerpanda
0 Kudos
AntonVZhbankov
Immortal
Immortal

vyking56, number of ports per vSwitch can be related to your problem too.


---

MCITP: SA+VA, VCP 3/4, VMware vExpert

http://blog.vadmin.ru

EMCCAe, HPE ASE, MCITP: SA+VA, VCP 3/4/5, VMware vExpert XO (14 stars)
VMUG Russia Leader
http://t.me/beerpanda
0 Kudos