Bouts of packet loss to management address but not...

Love4B · ‎08-25-2011

Hi all,

I recently got VMware ESXi 3.5.0 build 207095 working on a desktop dinosaur (Brookdale, P4 1.4GHz, 1.5GB PC133), and it actually worked surprisingly well (given reasonable expectations). That having gone well, but my having another somewhat more muscular desktop dino (Brookdale-GE, P4 2.6GHz, 533MHz FSB, 2GB DDR) that was underutilized, I decided to transfer the setup to the latter so I could get more out of the "server" and reclaim the former as a client. That's where I ran into a lot of problems; ironically, the faster machine is giving me far more trouble than the slower!

(And yes I realize I should ideally buy a real server to run ESXi on, but my budget is virtually non-existant and there are so few clients and so few VMs (3 tops) that the dinosaur should be able to get the job done, and in fact the slower one came pretty close.)

My present problem is that I frequently get disconnected from the host in the VI Client. (Such disconnections when the same ESXi was running on the slower system were fairly rare.) If I ping the host's management address, I see periods where it is fine, but long (often 30 seconds or more) periods of 100% packet loss. This is mildly correlated with workload in the VMs, but even when they are idle or even powered off entirely, I see streaks of 100% packet loss to the management address.

By contrast, I can ping the VMs themselves with no packet loss even under load. If I run continuous pings side-by-side, I will see frequent periods where the management address is dropping all packets but the VMs are responding fine. Even in these cases, response time from the VMs is typically <1ms (with occasional packets more significantly delayed but still getting through).

I don't suspect a physical network issue, as the VMs are using the same NIC as the management address, and don't suffer any connectivity problems.

If I access the support console on the host, I can ping the client with no packet loss, and in fact, if I begin pinging the client from the host, it often restores ping response in the other direction. If I use "ping -c 0 #.#.#.#" in the support console to continually ping the client, with the client likewise pinging the management address, the client will still get sporadic dropped packets, but not long runs of 100% loss.

I've also tried continually pinging the management address from within a VM - easily monitored via SSH to VMs, which stays responsive. The result is that the management address continues to respond to pings from within the VM, with essentially no packet loss, even when it is on a streak of not responding to pings from outside the machine (for 30 sec. or more). Most such VM-to-management pings are answered in <1ms, though occasionally they may be as slow as 2 or 3 sec. (Not sure if a 2 or 3 sec. delay might be considered lost by the pinging I'm doing from outside using Windows.)

Another issue (not sure if related) is that it seems like copying files via CIFS into a lone VM running Solaris 11 Express actually achieves considerably less overall throughput (more than 2x less) on the faster system than it did on the slower one! Guest remote desktop responsiveness seems better, though. However, I haven't tested this in a controlled manner.

In all cases, all machines involved are on the same subnet, separated only by a small 10/100Mbps switch (D-Link DSS 5+) with only mild traffic to other ports.

I do see a line repeated frequently in the logs, which looks related:

Hostd: [<timestamp> 'App' 131081 error] Failed to send response to the client: Broken pipe

Some background: In order to get networking up on the faster dino, I had to firstly not use the on-board NIC (not on HCL and didn't work). Secondly, the NIC I had running fine with the slower machine (but not on HCL either) got a strange problem on the faster machine wherein all would be fine for a consistent 10 minutes and then networking would go dead completely (management and VMs) and had to be rebooted to restore connectivity for another 10 mins. Networking didn't work at all if I popped in an AGP video card instead of using the integrated VGA. (BIIIIIZZZZAAARRRRE!!!) I looked for a BIOS update, but IEX423EM was later than any I found on-line. Anyhow, the catastrophic networking failures went away when I switched for a 3Com 3C905.

I'll be very appreciative if anyone can suggest what might be going on here (re. the frequent management port outages) or what else I might look at.

Many thanks,
Kevin

Love4B · ‎08-27-2011

Hi,

Well, I found out what was going on (although not why).

In switching to the 3Com NIC, I had stolen that from another machine in the network and had to replace it with the IBM NIC which had been in the VMware box (and not working right, for unknown reason). Eventually I started to get wierd networking problems with that other box as well, which led to the explanation. It so turns out that for some unknown reason VMware had remembered the MAC address of the IBM NIC and was still using it when the 3Com was in there, which meant two machines on the network with the same MAC - obviously trouble.

The VMs were unaffected since they don't use the host's hardware MAC address, although apparently the management system does (contrary to what some threads in the forums claim). Management connectivity was temporarily restored by pinging out from the support console because this updated the ARP cache in the physical network switch to pull the MAC address back to the VMware box' switch port. I confirmed all this with several tests including dumping client ARP caches.

I have no idea why VMware clung on to the old card's MAC address. It did show the new card in the local console's menus, but it showed it as being "disconnected". It also showed as "down" in the VI client, although it obviously wasn't down because otherwise I couldn't have even connected as it's the only NIC in the box (not counting the disabled on-board which isn't hooked up anyway). Must be VMware bugs.

I also could not figure out how to correct the issue, short of reinstalling VMware, which is what I ended up doing. Now it works fine, although oddly enough it still shows the 3Com NIC as being "down" even though it obviously isn't.

Cheers,

Kevin

All

Bouts of packet loss to management address but not VMs