VMware Cloud Community
crimsonbox
Contributor
Contributor

2 hosts, same configuration but VMs on one host has intermittent network

Dear all,

I have two hosts (A and B) both running on ESXi 4 having the same configuration and connected to the same switch. Both are in the same subnet.

However, host A is having an issue where some VMs have network but some VMs don't (cannot ping gateway). I did a test by migrating the VMs which did not have network to host B and immediately, the VMs were able to ping the gateway. Just for your reference, the VMs are all assigned static IP addresses and all VMs on host B are having no problems at all i.e. all can ping gateway.

I had ssh into host A and surprisingly, I was able to ping the VMs which couldn't ping the gateway. These same VMs also had no problem pinging host A.

Right now I am lost because I am unable to figure out how two hosts with the same configuration has one of it not working but the other works fine. And what's worse is that the VMs on host A are important VMs and I need to get this fix by the end of the weekend otherwise I will have to face some very angry customers Smiley Sad

Any help is much appreciated..

0 Kudos
11 Replies
wtfmatt
Enthusiast
Enthusiast

Is the hardware identical? Specifically, same physical NICs? (brand/etc)

- Have you double checked speed/duplex

- Have you tried plugging in the bad host to another port on the switch?

- Firewall settings?

Can you post screenshots of the network configurations for both hosts?  Sometimes it helps to have fresh eyes look at it.

0 Kudos
idle-jam
Immortal
Immortal

can you change the virtual machine network to another vmnic? it could be bad NIC, cable or switch port ..

0 Kudos
a_p_
Leadership
Leadership

Assuming you use a managed switch, the issue you see is most likely caused by the physical switch port  configuration. Either you did not configure "spanning-tree portfast" or  - what I think is the reason - the ports are configured for port security. Make sure the ports are set to "mode access" (or Macro "Desktop" disabled) to allow multiple MAC addresses on the port.

André

0 Kudos
crimsonbox
Contributor
Contributor

I switched the bad host to another port on the switch and things remain the same.

Before I switched the port, VM A was working fine but after switching the port, VM A was not working.

Then I also noticed VM B was not working previously but after switching the port, VM B is now working.

I did an arp on the VMs which are not working and it says HWaddress incomplete? Can someone explain to me what this means?

0 Kudos
a_p_
Leadership
Leadership

To me this definitely sounds like a port security issue. Can you please provide some information on your physical switch. Which vendor/model? It would also be helpful, if you could provide the configuration for the ports in use for your ESXi hosts.

You can find a sample switch/port configuration at http://kb.vmware.com/kb/1004127

André

0 Kudos
AndyKB
Contributor
Contributor

A bit of info on your hardware (switch, NICs and vlans/trunks) would really help Smiley Happy

Some screenshots of the NIC configs on the ESXi boxes would also help :smileysilly:

0 Kudos
crimsonbox
Contributor
Contributor

Hi all,

Unfortunately I am unable to provide any information on the physical switch used as it is managed by the field services team in my company. Tried to look through the server rack but couldn't see clearly the vendor of the switch. I will try to get this information as soon as possible.

My lead for this project however suspects that the issue may not be with the physical switch as we can still ping the gateway from the esxi host.

He's suggesting that we do a short test where we plug another network cable from the switch and into the 2nd port on the esxi host and then create a new vSwitch where we will move one of the VMs (which cannot ping the gateway) into this new vSwitch.

Any comments on this? And also, how can I move one of these VMs to the new vSwitch?

0 Kudos
crimsonbox
Contributor
Contributor

Just to update my earlier post, so we did the test where we plugged in a new cable into one of the free ports on the switch and into the second network port on the esxi host. I also managed to figure out how to change the VM to use the second vSwitch I created and amazingly, the VM which could not ping the gateway now immediately works fine.

But right now, the issue is still puzzling as why this happens. Anyone might have any idea what could be wrong?

0 Kudos
alvinswim
Hot Shot
Hot Shot

crimsonbox,

We're actually experiencing the same thing. And we've seen this behaviour for the last few months. First we thought it was a glitch with 4.1, so we upped every host we have to update1 in late April. everything was fine for a while, then it came back.

Here's a little bit of background, we've run ESX 3.5u3, 4.0 all on the same blade servers over the last 2-3 years. Same setup, no changes and everything had worked fine. we upgraded to ESX 4.1 in November/December, didn't have any issues until late march we started seeing vm's loosing connectivity.

First we saw vm's loosing all network after vmotion, that was fixed by vmotioning to another host. and when that didn't work we'd flap (connect/disconnect) the vm interface and it would return to life.

Then we started seeing exactly what crimsonbox was seeing. Vm's on the same network segment on either the same blade or different blade-same chassis or different-blade-different-chassis loose network. The VM couldn't see SOME guests on the same net segment, but they could see or communicate out of the segment (like through a vpn tunnel to our office via rdp or to a db server on the db network segment) it would be able to ping a few other vm's on the same net but not all. sometimes flapping the vm net interface brought it back, sometimes moving to another chassis/server would work. The most drastic would be to fully reboot the vm, but sometimes that wouldn't work either.

The problem here is that we're not seeing any pattern. For the meantime we've suspended DRS vmotioning. so that we don't get any network handoffs, or change in state with the vm net, but it still happens.

we've even disabled the physical VMhost Nic's (removing from the vswitch) and that eleviated the problem, sometimes we'd bring vmnic0 out and the traffic would be normal again, sometimes we'd have to swap it out to vmnic1. however we don't want to run on a single vswitch host nic environment

we've also replaced all our blade-sw to core cisco-sw connections with brand new cables...

We've checked and double checked our blade switch configs as well as core switch configs.. But we can't see why this is happening because we've had this exact same setup for the last 2-3 years.

about the hardware, we have 2 x Dell M1000 blade chassis, both with dual Dell M6220 Blade switches, connected to two Cisco 2960G core switches.

the VM hosts are 6xM600 Blades and 2xM610 Blades with dual fabrics VM Network and iSCSI network.

We have 3 VLAN segments for our vm networks and our servers have 24 or 48 vswitch ports. We suspected we could have been running out of vswitch ports, but esxcfg-vswitch doesn't seem to show that.

Our vswitches are set to Port ID load balancing. Our Blade switch to core switches are connected via portchannel (2 ports per blade switch)

At this point it looks like its a vmware thing or it could be a little bit of dell/cisco/vmware... But now that I know someone else is experiencing this, I tend to want to look towards vmware...

Any help is appreciated, and if you all need more information I'll be glad to post it up..

thanks ahead of time

alvin

0 Kudos
alvinswim
Hot Shot
Hot Shot

bump? any ideas?

Some updates on our situation, we found a bunch of machines from one of our corp branches with nic's just hanging out doing nothing with the default 169.xxx.xxx.xxx addresses that are assigned by windows when they have nothing attached to it. We promptly went in to each machine and removed those nics..

everything seems to be ok now.. but I am skeptical if its something as simple as that. 12-13 un-IP'd nics broadcasting looking for a DHCP server periodically flooding the entire vlan (same vlan we are having issues with) with broadcast traffic... could that cause the vSwitch to act funky across the entire cluster???

maybe causing the vSwitch to overload? are there any logs for the vSwitch anywhere? to me it looks like a blackbox..

anyway, any suggestions would be very much appreciated.

0 Kudos
DSTAVERT
Immortal
Immortal

You should create your own post and provide all your own details. You are not just adding information to help someone resolve their problem so responses may not be appropriate for both. Reference this post if you like.

-- David -- VMware Communities Moderator
0 Kudos