VMware Cloud Community
MarcoKohler
Contributor
Contributor
Jump to solution

Windows 2000 Terminalserver freeze after 10 RDP Session on esxi 5

We  have 4 old Windows 2000 terminal Server with Citrix installed. They work since years very well on esx. Since I have upgraded  (fresh esxi 5 installation), those server works for about 30 minutes very well, then the system process takes more cpu every second. Suddenly the server freeze for  a few seconds. 2 Minutes later, the Server does bit repsond in the GUI, net connectivity is ok (only ping tested). I have tried a lot of settings, no luck at all.

Only disable acceleration helps, the server is not freezed anymore, but noone is ableto logon to the server because of some dll error's. Enable the Acceleration again freeze the server again within seconds. Then I moved the server with vmotion to an esxi4.1 server, everything working well!

I did a clone of this server and this is now running on the new esx5 server, without any citrix user logged on, I cannot reproduce the error.

Has someone an idea?

Best

marco

0 Kudos
45 Replies
agesen
VMware Employee
VMware Employee
Jump to solution

Apologies for noticing this discussion a bit late.

It is not entirely clear what is going on here yet, but we are
looking at one possible theory: that changes that were made
between 4.1 and 5.0 to solve a correctness bug related to
coherency of translated code in some cases (for some workloads)
cause an unacceptable slowdown.

I cannot say with certainty whether the cases reported in this
discussion are related, but they might be. If they are, one
possible way to overcome most of the slowdown might be to switch
from BT to either HWMMU or HV monitor mode.

[General background. We have three ways of running guests. These
so-called "monitor modes" can be selected in the UI. We generally
refer to them as BT (for binary translation with software MMU),
HV (for hardware virtualization, i.e., VT-x or AMD-V with software
MMU) and HWMMU (i.e., VT-x with EPT or AMD-V with RVI. You can
find more details here:
  http://www.vmware.com/resources/techresources/10036].

So my suggestion/ask is: if you can confirm that the cases where
this problem is seen are using BT mode to run the guest, please
try either HWMMU or HV mode and let us know if things improve?
This, of course, assumes that you are running on a sufficiently
new physical CPU for these modes to be available.

How to determine actual monitor mode on ESX 5:


   grep "MONITOR MODE" vmware.log

You should see lines like this:

2011-11-21T10:14:07.154Z| vmx| MONITOR MODE: allowed modes          : BT HV HWMMU
2011-11-21T10:14:07.154Z| vmx| MONITOR MODE: user requested modes   : BT HV HWMMU
2011-11-21T10:14:07.154Z| vmx| MONITOR MODE: guestOS preferred modes: BT HWMMU HV
2011-11-21T10:14:07.154Z| vmx| MONITOR MODE: filtered list          : BT HWMMU HV

The first mode listed in the "filtered list" is the mode actually used.

You can also see which modes the host is capable of supporting

("allowed modes").

Sincerely,
Ole

0 Kudos
vMario156
Expert
Expert
Jump to solution

Hi Ole,

here my feedback to your suggestion:

I also had the same issue like the others posting in this thread.

I switched the MMU settings from Automativ to HWMMU and it seems to working fine. I have tested this seetings now for around for 1 week with some VMs and i am now going to migrate all our Win2k-Srv VMs to HW8 and a ESXi-5 host during the next weeks.

After this i can update the last ESXi-4 server to version 5.

So thank u very much for this hint!

Another thing i noticed (but which seems also apply to version 4):

If i am running the Win2k VMs with the default "CPU-driver" ACPI-Multiprocessor the CPU utilization between GuestOS & vCenter/ESXi didnt match at all.

The VM had a CPU load of 20% for example but the vCenter stats showed a load of 80% and so on.

I solved this by switching from ACPI-Multiprocessor to Standard-Processor. Even if Microsoft isnt recommending this i hadnt have any problems. After a reboot the VM booted up normaly etc.

Kind regards and a happy new year to all

Mario

Blog: http://vKnowledge.net
0 Kudos
Virtualinfra
Commander
Commander
Jump to solution

Take a look at this KB Also might be helpful..

http://kb.vmware.com/kb/1077

Thanks & Regards Dharshan S VCP 4.0,VTSP 5.0, VCP 5.0
0 Kudos
WearsABunchofHa
Contributor
Contributor
Jump to solution

I see you didnt reply for quite some time, so Im going to assume that was a positive thing.

I reported this case some time ago (exact same scenario) and opened a VMware case.  As of yet, everything short of reverting and converting back to HWv7 on esxi 4 was useless.  This completely breaks my quadruple redundancy HA design, and is really annoying.

Im doing the same to 1 of my test servers (HWMMU from Automatic).  Im going to leave it as HWv7 for now.  I will know within 48 hours if the issue is gone.  As mentioned, the issue tends to show up after 30 minutes after usage, but it is hit or miss depending on the number of sessions on the TS.

I will report back ASAP if this resolves for HWv7, and I will reply back after moving to HWv8 as well.

esxi5.0 was built from scratch and VM was never converted, they were always built from scratch as well.

thanks so much (to everyone in the forum) if this resolves.

0 Kudos
KevinSandbek
Contributor
Contributor
Jump to solution

None of the ideas thus far have worked.  Vmware has been completely Silent on the issue, which is amazing to me.  

0 Kudos
WearsABunchofHa
Contributor
Contributor
Jump to solution

My case in also in a run around spiral. Collect logs, collect logs, collect logs. They did get back to me last night, but the request is once again to collect logs (the same ones I have already collected and uploaded).

From the cloud

0 Kudos
coronaXPP
Contributor
Contributor
Jump to solution

I can confirm that changing the MMU settings to HWMMU fixes the freezing bug I have experienced.  Thanks.

0 Kudos
WearsABunchofHa
Contributor
Contributor
Jump to solution

Thanks for that. I will see how today goes. To clarify, changing the MMU settings solves the problem on esxi 5?

Thanks

From the cloud

0 Kudos
coronaXPP
Contributor
Contributor
Jump to solution

Correct.  VMTools and hardware version were also upgraded.  However, when I do Power -> Shutdown Guest, the VM will not power off.  Instead, it will show "It is Now Safe to Turn Off Your Computer".

0 Kudos
KevinSandbek
Contributor
Contributor
Jump to solution

@Wearsmanyhats -  Those settings didnt work for me.  My vm's CPU steadily goes to 100% after more than 2 or 3 users connect to TS, and slows to a crawl.

0 Kudos
WearsABunchofHa
Contributor
Contributor
Jump to solution

After day #1, not a single complaint about the guest VM.  I was able to see at least 6 TS sessions that went from ~0900 straight through until ~16:00.  I even called the most prolific user of the group who stated it was "just like normal".  Normally a quantity of sessions 3 or beyond would trigger fatal death.

thanks again everyone.  I will continue to post as I go, hopefully with a full resolution soon.

0 Kudos
KevinSandbek
Contributor
Contributor
Jump to solution

So just to clarify, your symptoms were CPU spiking to 100% and everyone slowing down to an unusable crawl, on win2k TS, on vmware esxi 5, hardware 8?

If so, can you reiterate your fix?

Thank you...

0 Kudos
WearsABunchofHa
Contributor
Contributor
Jump to solution

I will try and post a complete syllabus here:

Started day #0 with 2 hosts running esxi4 latest update. All was well. I have about a dozen Windows 2000 terminal servers running Citrix Metaframe XP that host my primary ERP as well as a few other very specific apps. Ive been trucking along like that without change for a solid 2 years (aside from VMWare updates / patches). I have a few other servers, but they are 2008 R2 for the most part. Every other server OS has been problem free and is not part of this scenario.

day #1 (upgraded 1 HOST to esxi5 and replaced the other HOST with new hardware, installed esxi5 from scratch). My current setup on this cluster is 2 hosts, 1 Dell R610, latest everything, 1 Dell R710, latest everything. Identical CPU's (to the stepping) and identical RAM. Did the updates throughout the night. Storage is a NetApp FAS2020. Switching is Cisco 3750G. I got the hosts to esxi 5, and updated the guest VMWare tools (change user /install on the TS boxes) but did not upgrade hardware version on anything.. just in case. There were no infrastructure changes, no hardware changes, and all host settings were retained identically. The only hardware change was the Dell server replacement with greater horsepower, identical CPU and more RAM.

Next morning, 0800 sharp, phone is ringing off the hook. Primary ERP application is unusable.

I spent the next 4 days of my life doing everything under the sun. I mean everything to the point of even rebuilding the terminal servers. After 4 days of not sleeping, and my end users loading their muskets, I called VMware. The first thing they had me do was obviously update hardware version to 8. That did not resolve my problem at all, and I was then in a real pickle.

I managed to find this thread, and I saw that someone suggested they had resolved the problem by rolling back to esxi4. At this point, I just needed to solve the problem before the rioters broke down my cardboard barrier.

Because I had upgraded to HW8, I had to spend another sleepless night moving the host back to esxi4, and then using converter to bring the servers back to HW7. We then built 4 additional terminal servers so we had a larger pool with which to play with. (This allowed us to have enough TS's to run a working production system, as well as a complete test system where we could initiate incremental changes).

The next day, the phone calls stopped, and the people rejoiced. But Im still on esxi4, and I cannot use automated DRS...

From there it went to gathering logs back and forth with VMWare, but never finding much to work with. They did suggest bandwidth / NFS traffic contention, but Im still a little suspicious as simply moving everything back to esxi4 resolved the problem. That was about 4 weeks ago, so it clearly did resolve the problem.

Im now on day#2 of testing after changing the monitor mode to HWMMU. That server is operational and has not had a complaint yet. I migrated another of the TS (without switching monitor mode) onto the esxi5 host last night, and within 40 minutes I had phone calls reporting it was unresponsive. So clearly the problem still exists, but it would appear to not be affecting my server with the monitor mode changed.

Im going to change the HWMMU settings on a second guest tonight, and upgrade the hardware version on the already repaired server.

The behavior seen on the guest when they are failing is that they become unresponsive. When I can actually get a process monitor, its not like the server is actually doing anything. I have a couple nice screen shots I took, where our of the ~80 processes running, the entire task manager shows every process running like at a minimum of 3% CPU. This makes no sense. Its not like anything is happening, its like the server is simply unable to continue processing tasks. It builds and builds and builds, and then slowly becomes unresponsive until it KO's. RAM is not even used. Both VMWare and the local Windows task manager show CPU @100% eventually. At that point, it is usually easier to just power off the box, rather than try and log in to safely restart.

I will try and reiterate the fix tomorrow. I will post back.

0 Kudos
KevinSandbek
Contributor
Contributor
Jump to solution

Nice post. That is literally identical to my issues, to a T. Did the same exact steps you did from start to finish and have same exact behavior from the VM’s. I tried the HWMMU fix but it did fix it, but I’m going to try once more for the heck of it. Thanks so much for this detailed post. I also had a ticket with Vmware and they were of no help.

0 Kudos
agesen
VMware Employee
VMware Employee
Jump to solution

Kevin, thanks for giving HWMMU one more try. I expect that it

should work.

But please note that the HWMMU option is only supported on

some CPU types. If you CPU doesn't support it, the settings

will be silently ignored.

To know if the settings take effect run the grep command that

I listed in my Nov. 23rd posting in this thread.

Also, we believe that we understand what is causing the

hang and have developed a fix for the problem. We are

tracking this as bug 789483. The fix is in 5.0 update 1.

Meanwhile, the HWMMU setting should completely work

around the problem that we found.

Thanks,

Ole

0 Kudos
WearsABunchofHa
Contributor
Contributor
Jump to solution

Just a quick update.  My end users are out of the office the next 2 days so I wanted to confirm or deny my resolution.

I just said the heck with it, and changed the monitor mode to HWMMU on each of the terminal servers and migrated them all onto the esxi5 host.

I did that around noon today, asked helpdesk to just forward every single ERP related call directly to me and sat back waiting in fear.

I didnt get a single phone call.  Everything went to perfection.  I even called my star prolific user again whose comment was "should it be different than usual?"

So, I went ahead and updated my remaining host to esxi5 as Im fairly confident in my own results, and from Ole's most recent post.

I still have yet to upgrade the VM hardware version, and Im not going to go down that rabbit hole again until my users are back (Monday).

once they return, I will upgrade one guest vm's hardware to 8, and see what happens.  If I dont hear anything, then Im going to do them all, and be done with this.

thanks again to everyone who contributed to this thread.  It really helped.

0 Kudos
agesen
VMware Employee
VMware Employee
Jump to solution

The following KB article provides the official guidance on this issue:

   http://kb.vmware.com/kb/2012205

Thanks,

Ole

0 Kudos
gladder
Contributor
Contributor
Jump to solution

Hi guys!

I was experiencing exactly the same issue with my ESXi 5.0.0 hypervisor and my Windows 2000 Server VM using Terminalservices and just wanted to tell you that updating to ESXi 5.0.0 Update 1 fixed this hole thing! Everything is working well right now - I've tested this with 35 concurrent RDP Sessions - every Session having Word, Excel & IE opened. CPU usage has been very low (0-4%) - the overall performance is great now. (freezing with 12 Sessions before). That makes me happy Smiley Happy because the workaround mentioned here and in the KB2012205 did not work for me (I'm using software virtualization 😞 not having any hardware virtualization capabilities...)

Unfortunately Automatic Startup of VMs after Startup or Reboot of ESXi Hypervisor is not working anymore. Tried several config resets without success :smileycry: I can live with that....

Thanx for this discussion that advised me to look for the update and sorry for my rusty english

Kind regards!!!

Patrick

0 Kudos
MarcoKohler
Contributor
Contributor
Jump to solution

Great, thanks for all your Support. I did today the Update 1 and it looks really good now.

Best,

Marco

0 Kudos
PRSwiss
Contributor
Contributor
Jump to solution

Hi,

I have the same problem with TS 2000 with only 10 users. Last week I made the update to ESXi 5 Update1 (623860).

I also tried all the settings for the CPU too, Intel VT - MMU etc...

Unfortunatly, the server still have high CPU, it's not 100% like before, only between 40% and 60% but my users cannot work at normal speed.

The only thing I didn't do is to migrate the virtual hardware to version 8, because I fear the process of going back to the version 7.

So today for the second time,  I put back the TSE to a 4.1 again... and the CPU is now 0% to 15%.

Any Ideas would be appreciated.

M. Dang

0 Kudos