VMware Cloud Community
rreynol
Enthusiast
Enthusiast

VMotion - Inconsistent ping loss

One of our users reported network disruption whenever we migrated their three VMs (RedHat 5 -64bit, 8192M RAM, 4 vCPUs). After further investigation with their test VMs of the same size we were able to replicate the problem of a 20% - 30% ping loss when pinging from outside their subnet or within their subnet.

Working with VMware support we setup a crossover cable for the VMotion traffic and a crossover cable for the IP traffic, between two ESX hosts (HP Proliant DL585 G5 4 way quad core with 128G RAM). This setup completely eliminates the one switch that was involved in the original setup. We setup new vSwitches and port groups for the crossover cables.

The problem still persists. In our other clusters we also have even larger RH5 VMs that do not exhibit this problem so our best guess at this time is something is awry within the VM itself.

Just curious if anyone else has ever seen behavior like this.

Message was edited by: rreynol

I should add that this is VC 2.5 update 2 and VI 3.5 update 2 and VMware tools shows as up to date.

Tags (2)
0 Kudos
1 Reply
rreynol
Enthusiast
Enthusiast

We more or less have a "resolution" on this problem. After the VMware network engineer determined it was not a network issue he consulted with a research engineer on the system side and they determined that this is most likely a memory issue.

There have been three other support cases opened with the same symptoms that we have. A VM with large memory stops responding to the network during a VMotion. After investigating this VMware determined that the problem was when the VM used more than the default 4K memory pages. During a VMotion the VM loses contact with the larger memory pages beyond 4K so the Operating System has to take action to re-establish the memory pages. This action is given the highest OS priority so the VM appears to pause during the action, which can take several minutes to accomplish.

As an improvement VMware has a setting for the vmx file to deal with these large pages:

1. Shutdown the VM

2. Edit the .vmx file and add this line: monitor_control.disable_mmu_largepages= "TRUE"

3. Start the VM

They originally had me edit the .vmx file then reboot but that did not keep the change to the .vmx file, it creates a handshake error to vmx file and removes the change. I have seen this before that Virtual Center caches the contents of the .vmx file and does not do well when you directly edit the .vmx file while the VM is running.

After I made the above changes the number of ping losses did drop by about half but it is still higher than we would like.

VMware considers this a known issue and plans to fix it in ESX 4.

0 Kudos