rsarran
Contributor
Contributor

VM cannot ping Host and vise versa

Jump to solution

This is a very puzzling problem.  VMWare support has been trying to figure this out as well as Dell.  So, I am just throwing this out to the community to see if anyone else has experienced this issue and may have a solution.  I have 3 identical Dell R720 servers.  2 work with no issues, but 1 (call it vm8) has been giving me problems since day 1.  Dell checked the hardware today and had me update the BIOS, firmware and drivers on vm8, which did not resolve the issue.  VMWare technicians checked every network setting over the past several weeks and they currently cannot find the cause.

VM8 has ESXi 5.5.0 installed.  The 4 server has 2 nic cards with 4 ports each.  Current configuration is vmnics 0-3 are connected to our LAN, 4-5 to our DMZ and 6-7 to our SAN (iSCSI). The HA goes up and down because VM8 loses connectivity to our isolation address (gateway).

VM8 (Network Mgmt IP is 172.20.100.9) only has 1 VM (172.20.100.40). Same subnet (255.255.255.0).  .9 times out pinging .40 using vmkping.  When I ping .9 from .40, the first packet gets a quick reply, then all following packets timeout.  According to VMWare, when you ping within (VM to host) it does not go out through the physical nic to the physical switch.  Everything is internal with the vmnic and vSwitch.  When I ping my gateway (172.20.100.1), the ping is successful.  When I ping .9 from my workstation, the first packet times out, then the following packets get a reply.  It's the exact opposite of pinging from the VM.

Here's a better breakdown-

.9 VM8 Host

.40 VM on VM8 host

.1 Gateway

.122 workstation on LAN

.25 vRanger (physical server on LAN)

Ping

.9 to .40 (100% packet loss)

.40 to .9  (75% packet loss)  first packet gets reply, next 3 timeout

.9 to .122 (0 packet loss) good ping

.122 to .9 (0 packet loss) good ping

.9 to .25 (75% loss) vmkping does not display each packet as it is sent.  But from other results, I can safely assume first packet times out.

.25 to .9 (75% loss) first packet timed out, following 3 got a reply

.40 to .122 (0 packet loss)good ping

.122. to .40 (100% packet loss)

All 3 can ping to .1 (about every 20 minutes on VM8 I get a "vSphere HA agent on this host could not reach the isolation address 172.20.100.1"

Also throughout the day, I get the message - "vSphere HA agent on this host cannot reach some of the management network addresses of other hosts, and HA may not be able to restart VM's if a host failure appears."  I have come to work in the morning and all of my VM's on VM8 have migrated to my other 2 hosts.  My backups don't work on VM's on VM8.  I use vRanger and when I ping VM8 from vRanger (physical server), the first packet times out and the following packets get a reply.  So, when vRanger goes to backup my VM's, if fails because of the initial packet loss.

These are things that I have tried already.  I tested each physical NIC individually.  I removed every port on both NIC's to try and isolate a specific port. All 4 vmnics are active adapters in the Management Network Properties NIC Teaming and I moved each vmnic individually to unused to test each port.  I have replaced the Cat6 cables.  I have used different Dell switches and different ports on the switch.  I even swapped the ports on the switch that another host used, ruling out a switch port configuration issue. Also, port security is disabled on ports.  I upgraded ESXi 5.5.0 to a newer build.  There's a know issue with the tg3 driver, which I have upgraded to the latest version without the problem.  I also used the tg3 workaround by disabling NetQueue.  And we do not use VLANs. Dell tech support states that it is not a hardware issue and believes it is a Layer 2 issue, but is not sure where.  Basically, it's either an internal problem (meaning strictly on VM8) with vSwitches or vmnics or it's a hardware gremlin in our Dell R720 box.

Dell's final recommendation is to blow away ESXi on the server and install a clean copy.  This is extremely frustrating and I am running out of ideas.

Thanks in advance.

1 Solution

Accepted Solutions
joshopper
Hot Shot
Hot Shot

Any chance you have a duplicate IP on your network?

View solution in original post

0 Kudos
8 Replies
Kahonu84
Hot Shot
Hot Shot

Assuming the names of ESX1 and ESX2 for the working hosts and ESX3 for the

one in question... Can you connect ESX3 to either ESX1 or ESX2's physical

networking?? If the problem goes away, the problem is in the networking. If

the problem remains, the problem is with the host itself.

0 Kudos
rsarran
Contributor
Contributor

If you are asking if I swapped the physical cables from ESX3 with ESX2 or ESX1 to see if ESX3 would work with the physical network of the 2 or 1, then yes I have.  Today, I just had another one of support vendors comb over ESX3 and they are also confused.

This what it is coming down to is the vSwitch on ESX3.  For example, the network manangement ip address for ESX3 is 172.20.100.9.  I have a VM on ESX3 with 172.20.100.10.  My desktop computer is 172.20.100.11.

172.20.100.10 is in the same vSwitch as the network management (.9).  They cannot ping each other.  I can ping .9 from my computer, but not .10.  And I can ping to the outside world from .9.  That tells me that the physical network, including the NIC card on ESX3 is working.  So, the issue has to be with the traffic within the vSwitch.  I have recreated the vSwitch several times.  Tomorrow, I plan on reinstalling ESXi 5.5.0 update 3b on ESX3.  Hopefully, fingers crossed, that this will fix the issue.  I will post my results.

0 Kudos
BoneTrader
Enthusiast
Enthusiast

i would recheck the following:

-> wiring (also if possible maybe try different switchports)

-> Firmware Version of the NICs

else:

-> evacuate the host

-> scrub the disks

-> fresh install

0 Kudos
rsarran
Contributor
Contributor

I have been working with vmware, Dell and another vendor.  It is not a physical network issue.  I can ping the host and the VM on the host from my computer.  From the host, I can ping out through the LAN (my computer, DNS server, gateway, google, etc.).  From the VM, I can ping out to everything on the LAN.  The problem is in the vSwitch.  The VM's cannot ping Network Management (first packet gets a reply and the rest time out).  And the Network Management gets the same ping results when pinging the VM's on the vSwitch. 

The BIOS, firmware and drivers were updated on the server several days ago. The vSwitch has been deleted and recreated.  I am starting to recreate it one last time before reinstalling ESXi 5.5.0.

0 Kudos
joshopper
Hot Shot
Hot Shot

Any chance you have a duplicate IP on your network?

0 Kudos
rsarran
Contributor
Contributor

There are no duplicate IP addresses.  The issue is within vSwitch0.  The Management Network is located on vSwitch0.  If I migrate a VM to the host, it is on the LAN and gets put in vSwitch0.  Pinging between to 2 does not work.  The first packet and the only the first packet gets a reply.  The rest time out.  According to vmware, when you ping within the vSwitch, it does not go out to the physical network.  Last week, Dell had me update the BIOS, firmware and drivers.  It did not help.  This morning, I did an upgrade to the latest build of vmware.  Still had the issue.  After that, I got rid of the OS on the server and did a clean installation.  I still have the issue.  Tomorrow, I plan on wiping the server again and reinstalling it with a different host name and different IP addresses.  This would eliminate any possibility of duplicate IP addresses or anything else that could be conflicting with the host.

0 Kudos
joshopper
Hot Shot
Hot Shot

Did you run a tracert to confirm the traffic isn't leaving the host?

0 Kudos
rsarran
Contributor
Contributor

It appears that the issue has been resolved.  I want to thank joshopper.  When I first started working on this issue, someone mentioned to check for duplicate IP addresses.  I did check and did not find any.  When joshopper mentioned it again, I decided to get a network scanner.  I downloaded a trial version of a good one.  It went through and did a extension ARP scan.  The report showed that I had a sonicwall reporting the same IP address.  This is a new sonicwall with a totally different IP address.  I'm not sure why that Sonicwall is reporting that it is the same IP as my host.   I don't know why my predecessor never tried this or I never tried it since I started here 6 months ago, but I changed the IP address of the host to a known clean IP address and the issues are gone.  The only thing I can think of is there must be an ARP table somewhere with that address.  It makes sense how a duplicate IP address could cause this issue.  I'll let the network tech try to find why his Sonicwall is screwed up.