Hello,
We have virtualized some Windows domain controllers on ESX 2.5.3.
Now, sometimes clients have long logon delays.
The packet analysis has revealed that virtual servers are available but respond with very small packets of 60 bytes or so (see traces below).
Did anyone have this problem or idea what can be wrong?
*********************
16:32:42.805730 IP server.aaa.com.445 > client.aaa.com.1053: P 9116:9182(66) ack 8695 win 63547
16:32:42.805730 IP client.aaa.com.1053 > server.aaa.com.445: P 8695:8758(63) ack 9182 win 63452
16:32:42.821355 IP server.aaa.com.445 > client.aaa.com.1053: P 9182:9248(66) ack 8758 win 63484
16:32:42.821355 IP client.aaa.com.1053 > server.aaa.com.445: P 8758:8821(63) ack 9248 win 63386
16:32:42.821355 IP server.aaa.com.445 > client.aaa.com.1053: P 9248:9314(66) ack 8821 win 63421
16:32:42.821355 IP client.aaa.com.1053 > server.aaa.com.445: P 8821:8884(63) ack 9314 win 63320
16:32:42.836980 IP server.aaa.com.445 > client.aaa.com.1053: P 9314:9380(66) ack 8884 win 63358
16:32:42.836980 IP client.aaa.com.1053 > server.aaa.com.445: P 8884:8947(63) ack 9380 win 63254
16:32:42.852605 IP server.aaa.com.445 > client.aaa.com.1053: P 9380:9446(66) ack 8947 win 63295
16:32:42.852605 IP client.aaa.com.1053 > server.aaa.com.445: P 8947:9010(63) ack 9446 win 63188
16:32:42.852605 IP server.aaa.com.445 > client.aaa.com.1053: P 9446:9512(66) ack 9010 win 63232
16:32:42.852605 IP client.aaa.com.1053 > server.aaa.com.445: P 9010:9073(63) ack 9512 win 63122
16:32:42.868230 IP server.aaa.com.445 > client.aaa.com.1053: P 9512:9578(66) ack 9073 win 63169
16:32:42.868230 IP client.aaa.com.1053 > server.aaa.com.445: P 9073:9136(63) ack 9578 win 63056
16:32:42.868230 IP server.aaa.com.445 > client.aaa.com.1053: P 9578:9644(66) ack 9136 win 63106
16:32:42.868230 IP client.aaa.com.1053 > server.aaa.com.445: P 9136:9199(63) ack 9644 win 64512
16:32:42.883855 IP server.aaa.com.445 > client.aaa.com.1053: P 9644:9710(66) ack 9199 win 63043
16:32:42.883855 IP client.aaa.com.1053 > server.aaa.com.445: P 9199:9262(63) ack 9710 win 64446
16:32:42.883855 IP server.aaa.com.445 > client.aaa.com.1053: P 9710:9776(66) ack 9262 win 62980
16:32:42.883855 IP client.aaa.com.1053 > server.aaa.com.445: P 9262:9325(63) ack 9776 win 64380
16:32:42.899480 IP server.aaa.com.445 > client.aaa.com.1053: P 9776:9842(66) ack 9325 win 62917
16:32:42.899480 IP client.aaa.com.1053 > server.aaa.com.445: P 9325:9388(63) ack 9842 win 64314
Since this is VMware 2.x, which VM nic driver are you using? VMXNET or VLANCE? Also, do you have the VMware tools installed on each of the virtual machines? Also, do you have gigabit connections for your virtual machines.
Michael
Yes, it is VMXNET and 1Gbit virtual cards.
Hello,
What happens if you switch to PCNET32? This is the first step in analysis as the vmxnet driver makes assumptions about the networking that may not be valid in your case. Run the same test using this option and compare the results. If the results are similar then it makes no difference which you use.
Next go to the SC and while you are running your test run esxtop in batch mode to capture the vmnic information. What is the packet and byte transfer speeds? Are you hitting the limits?
Are all these VMs on the same vSwitch?
Are you using load balancing or failover on your vSwitches?
Best regards,
Edward
Thank you for the suggestions. We will see what can we do in our conditions...
VMs are on different switches and we are not using load balancing/failover features.
The problem is occuring on production servers and only at periods of a higher load, like every morning when everybody logs on and download their GPO and logon scripts.
We have not been able to reproduce it in our lab nor in production under lower load conditions.
As the problem affects many users at the same time and is not instantly reproduceable, we cannot play too much in production with different options without a clear troubleshooting plan.
At this time we have captured the network traces with small packets.
We have also measured the performance counters inside VMs.
When we increase the load and the problem happens:
CPU peaks to 30 %
Network load is around 5-8 Mbit/s
so they are not too bad.
I would like to know, in principle, if this can happen because of virtualization. Has anyone seen something similar?
We have a large (15+ pgs.) thread going about network performance:
http://www.vmware.com/community/thread.jspa?threadID=77227&start=0&tstart=75
Most of the troubles that we are looking at are for ESX 3.0.1. Aside from the questions already asked, here are mine:
1. What are the virtual hw specs of your converted Domain Controllers?
2. What HW are you using for your ESX Hosts?
3. (asked before) Have you installed the VMWare Tools on these converted servers?
This almost sounds like your guest machines are having windows type issues, not VMWare.
After some additionnal analysis of network traffic we concluded that the problem is related to a Windows 2003 bug that does not grant (for whatever reason) Opportunistic File Lock to clients trying to read shared files.
This bug makes clients avoid buffering and forces them to read the file byte per byte(!!!), which results in a h-u-u-u-ge number of tiny packets going back and forth....
Solution: http://support.microsoft.com/kb/319440
Thank you all for your advice