VMware Cloud Community
saltnz
Contributor
Contributor

Random network issues on guests on certain host

Hi we have a VCenter4 infrastructure. Our largest cluster of 4 ESXi4 servers with VDSwitches sometimes has a random network outages on the guests of a certain host. The guests become unreachable for a random period of time randomly. All guests on that host are affected.

We have to put the host into maintenance mode and reboot. Seems to fix the problem until it resurfaces in another couple of weeks.

Because of the randomness it is very hard to troubleshoot. I have no idea what is going on, no errrors reported in VCenter would only be aware of it because of 3rd party monitoring and reports of services not running from users.

Any ideas on what is going on or how I could start to diagnose this? Have been on switches while condition happening and the interfaces report to be up

Reply
0 Kudos
10 Replies
andrewisett
Contributor
Contributor

Make sure that the guests don't somehow have the same MAC addresses. I had a couple of systems that seemed to work sometimes, especially after a reboot of one or the other servers, but then randomly couldn't access each other. On the settings of the machines, check the virtual MAC address to insure they are not the same for both systems.

If they are just change the last 6 digits to one that is acceptable in VMware's MAC address range. I suggest 00:00:01 or something similar.

When you do things right, people won't be sure you've done anything at all.
Reply
0 Kudos
saltnz
Contributor
Contributor

I ran a grep over the vmx files with the last 4 digits of one that I was having a problem with and that only came up with the single entry where i would expect it. I would need to wait for the condition to re-occur before I could follow that one up again with all the guests, I am under a bit of pressure to resolve the issue so when it happened today I just proceeded with the fix and only took note of one guest on that host

I would have thought DRS would have pointed out that one, seems to pick out everything else.

Thanks for the suggestion, but I will have to wait to gather more info before continuing that lead.

Any other ideas?

Reply
0 Kudos
AureusStone
Expert
Expert

Sounds exactly like a physical NIC issue. Replace the NIC you are using for your guests. Otherwise if you have redundency, you can disable the NIC from vCenter and see if it fixes it.

Reply
0 Kudos
sabya1232003
Enthusiast
Enthusiast

Please have a look to the dvswitch configuration and vNIC settings of the VMs ...MAC address/Speed-duplex and also check the VMware Tools running or not

Reply
0 Kudos
Samikouk
Contributor
Contributor

If you have a team of NIC's I would remove a single NIC association to the dvSwitch for a specified period of time and see if you lose connectivity.

It's all pointing to a faulty NIC

Also, you could monitor the status of your NIC's using ESXTOP, you can run it via your vMA using the following command: resxtop -server server.name

Switch to network monitor and add PNIC = Physical Nic Properties which shows you if the NIC is physically up and also you can watch the flow of traffic.

AlexNG_
Enthusiast
Enthusiast

Hi saltnz

Just wondering, but do you have NIC teaming? apart of what other said, can you check your switch logs? Do you have spannig tree protocol enabled? If yes, then or disable it or enable rapid spanning tree. You could take a look here:

http://kb.vmware.com/kb/1004074

AlexNG

If you find this information useful, please award points for "correct" / "helpful".

If you find this information useful, please award points for "correct" / "helpful".
saltnz
Contributor
Contributor

Thanks for all those tips, guess i need to wait till the condition comes back.

We are not using teaming we just have 2 NICs for the troubled DVS. No spanning tree enabled, but have forwarded that networking URL on to our cisco guy. We have had issues with the physical switches and we have had to make alterations in the past. We have a had a particular outstanding issue (logged with our network vendor) where we lose an uplink and you have to go to the switch and issue shutdown, no shutdown on the interface, which is why I have been looking there. However this is the first thing I always check with networking issues, I will try some of those other tips next time, but switch tells me the IFs are up and so does Vcenter.

Reply
0 Kudos
mkzero
Contributor
Contributor

Hi All,

we have the same issue.

Can you solve the problem?

Reply
0 Kudos
Shakaal
Hot Shot
Hot Shot

Hi,

I would suggest you to check the CPU utilization on the host at the time of issue, there is a possibility of less CPU resources causing the problem. run "esxtop"  command and check for %RDY field the value should not be more than 5, if it is then there is an issue with CPU resources, as network Packets need CPU for processing would request you to check for the same.

If CPU is fine check /var/log/vmkernel file and see if there are any messages related to Storage.

Apart from VM's losing the network are you able to access the console using VI client of the VM's?

would also like to know how the issues gets resolved, when it happens?

Regards

Reply
0 Kudos
mkzero
Contributor
Contributor

hello,

host utilization was under 5% ... so I don't think thats an cpu problem.

I will have a look at the kernel logs - but all 6 esxi 4.1up1 hosts in the cluster have the same nfs-datastores.

vm-console is ok, also the vm's - but not pingable from outside

@resolve issue: move the vm's via vmotion (yes vmotion works) and reboot host

Reply
0 Kudos