Solved: Re: VMs Pretty poor performance but...

kopper27 · ‎12-03-2009

hi guys some info fisrt

6 ESX 3.5u4 - 1 Cluster - 1 resource pool all values by default
32GB RAM each ESX = 192 GB total
8 CPU x 2.9 GHz - Intel Xeon X5770 eachESX.
View 3.1.1
270 VMs Windows XP SP3 - 1 vCPU - 512 MB RAM - power on all day long for servicing virtual desktops to Wyse Thin Clients. Software used: Internet Explorer - 2 third party apps - Word/Excel Viewers
IBM DS4700 Storage 7 L UNS about 150 GB available for every LUN - Connected using Fiber Channel -

This issue is been for 2 weeks now,

7:00 am - Some people start working - they take calls and use the Wyse Thin Clients - 30 people - So far so good

8:00 am - more people log to the thin Client - 30 more people - So far so good

9:30 am - Nothing changes the same amount of people logged but VM applications start running slow, clicking win menu takes a lot to show up, Internet Explorer is prettty slow to open, if they try typing in the third party app they cannot see the app being updated - basically they cannot continue worlkin.

Interesting point here is and it has nothing to do with Vmware View is when you reboot a VM that is having te problem it takes about 10 - 15 minues to be up and running again I mean to show the promt for user/pass again. Not all the machines are rebooted, jus 3 or 4 since they are completely useless.

11:00 am - The VMs that were not rebooted are working fine again and during the day all VMs stay working fine even when there are even more users since at 1:00 pm more people logs to the systems. yeah this is a call center

Any pointer what logs should I check? or where to start Troubleshootin

I really don't see any high CPU or memory usage.

Is it OK this all 270 VMs power on even when only a maximum of 120 can be actually being used.

Any other idea what could be wrong?

and the most important fact here all days is he same time abot 9:28 - 9:32 am ?:| even weekends when the amount of people is like 20

CPU - Memory Usage

thanks a lot

Scissor · ‎12-07-2009

I saw that you are running Symantec. Here is what I had to do to solve this problem in my ESX environment:

-Schedule the AV Definition updates to happen after hours on the server.

-Disable the "quick scan" that happens when a client downloads new AV updates: http://service1.symantec.com/SUPPORT/ent-security.nsf/docid/2006032313184248?Open&src=tranus_con_sl&...

-Reduce the number of simultaneous clients updates that the AV Server performs, as well as throttle the speed at which the AV server sends each update. I think I set mine to 2 threads and a 750 ms delay: http://service1.symantec.com/support/ent-security.nsf/854fa02b4f5013678825731a007d06af/2e726ee5a5d07...

- If you have any scheduled client AV scans, stagger their scheduling so that you don't have multiple clients on the same LUN scanning themselves at the same time.

View solution in original post

depping · ‎12-04-2009

I would suggest to dive into ESXTOP. %RDY / DAVG etc. There are various counters that need to be checked. But most like the two mentioned would give you a hint were the problem lies. You are running roughly 40 VMs per datastore which could mean that storage wise you are hitting a bottleneck. DAVG should reveal that. (Disk Average Latency)

Duncan

VMware Communities User Moderator | VCP | VCDX

-

Blogging:

Twitter:

If you find this information useful, please award points for "correct" or "helpful".

kopper27 · ‎12-04-2009

+I would suggest to dive into ESXTOP. %RDY / DAVG etc. There are various

counters that need to be checked. But most like the two mentioned would

give you a hint were the problem lies. You are running roughly 40 VMs

per datastore which could mean that storage wise you are hitting a

bottleneck. DAVG should reveal that. (Disk Average Latency)+

I assume if those values are high we miht have storage problems, BTW it's about to start again like in a hour almost 9:30

something else do you think page file has something to do with this VMs

I read the article you recommended

http://virtuall.eu/blog/creating-a-vdi-template

and that set it to 512MB these VMs is set to about 2GB - since they were previously configure to 2GB - oversized I found them like that

thank you

kopper27 · ‎12-04-2009

here 2 ESXs I now they are pictures not real time

Scissor · ‎12-04-2009

You must have a scheduled process that runs every morning across your clients. My money is on a scheduled AntiVirus update. I'll bet if you look at your esxtop disk statistics you'll see a lot of disk IO during the time that things are slow.

You can access the disk statistics by running 'esxtop -a' then pressing 'u'.

kopper27 · ‎12-04-2009

just updating after speaking with vmware support, yeah they confirmed might be some issues with the storage.

they recommended me create as a test a new LUN and add there 4 or 5 VMs and test results tomorrow or any time.

something else he did he change my service console memory assigned from 272 to 800

let's see I still have some scripts to run and I have to log a case with vmware view

Rumple · ‎12-05-2009

I would agree you are hitting storage limits as you are averaging about 5 vm's per CPU and about 8GB/host of usage. While the number of VM's per host is about max for recommended, being that they are single CPU VM's, scheduling should be pretty good.

You could ensure you have 800MB of RAM assigned to the Service console and bump up the CPU reservation for Service console a bit.

Overall though, as indicates, 10-15 VM's per LUN is about recommended even 20 wouldn't be bad in this case I don't think. Really depends on that storage design and how many disks are assigned to each aggregate

kopper27 · ‎12-05-2009

still diving on this

updates: same issue today just 1 minute difference

we rebooted the server today - for the service console change -

I had someone from storage today for the DS4700, and he only found that the Storage had active one path I mean only one path was handling all the information...OK second path was set to active, after that I assigned the path like this

LUN 1 - HBA1 - LUN2 - HBA2 LUN3 - HBA1

LUN4 - HBA1 - LUN5 - HBA1 - LUN6 - HBA2 - LUN7 - HBA1

One thing I get from everyone I mean here at forums and Vmware Support is were are hitting the datastore...check also Vmware Support

-The Vmkernel logs show a lot of SCSI 0/2 error messages. These messages indicate the SCSI BUS is busy due to resource contention.

-The replay of the esxtop performance data you gathered showed abnormally high DAVG values in the 400-500 range. DAVG is the average amount of time in milliseconds one I/O request takes to return to the vmkernel once it has been sent. Under normal operation this should be in the 0-20 range.

-We logged onto one of your Windows VM's and examined the Application Event logs. A Symantec AntiVirus scan is being run during the time of your performance issue. If all of your VM's are doing this at the same time when your users log on, this is the cause of your I/O contention

thougts in general

1. OK looks like we are exhausting the datastore. so that means more LUNs to move those VMs???? or any other idea? maybe just turning off the VMs not being used.

2. some other say something like backups - antivirus/updates - cron

backups: NO

cron: clean

antivirus/updates: could be why not but what explains that rebooting machines at that time takes 10 - 12 minutes....

tomorrow still more testing and vmware support guys reading the Big .tgz files.

Rumple · ‎12-05-2009

So what is your storage setup like anyhow?

Do you have one big RAID Group with multiple luns on it, or do you have multiple raid groups with 1 lun per raid group? How many disks per RAId group?

Are you using RAID 10 or RAID 5 group groups?

Have you confirmed if you have a true active/active array (because setting luns on the "passive" Storage Processor will cause path thrashing) and what the best pathing model is (fixed vs MRU, or Round Robin)

mcowger · ‎12-05-2009

1) Absolutely you need more LUNs - reservation errors mean you have too many VMs/LUN

2) Antivirus - it absolutely could cause this - if you have 100 machines beating the daylights out of the LUN into 500ms svctime range, a VM trying to boot form that LUN very well could take 10 minutes to boot.

You've overloaded your storage.

--Matt

VCP, vExpert, Unix Geek

--Matt VCDX #52 blog.cowger.us

kopper27 · ‎12-05-2009

So what is your storage setup like anyhow?

DS4700 - atachatted to a FC Switch

Each ESX server has 2 HBAs ports that are connected to the Storage.

*Do you have one big RAID Group with multiple luns on it, or do you have

multiple raid groups with 1 lun per raid group? How many disks per RAId

group?*

they are 7 RAID 5 - Each RAID 5 is a LUN so 7 RAIDs - 7 LUNS

3 HardDrive/300GB each per RAID 5

Are you using RAID 10 or RAID 5 group groups?

RAid5

*Have you confirmed if you have a true active/active array (because

setting luns on the "passive" Storage Processor will cause path

thrashing) and what the best pathing model is (fixed vs MRU, or Round

Robin)*

I need to check that to be pretty sure and MRU is being used but even though initially all ESX servers were using only HBA1, the storage guy said enable the second one for better performance

I hope that anwers your questions

kopper27 · ‎12-05-2009

What I am going to do tomorrow is

since all VMs are powered on that means they are doing I/O - consuming CPU - RAM, but since it looks datastore is the culprit here

I am going to turn off at least 20 VMs per LUN.

does it make any sense? they will still be there but will be doing nothing.

Rumple · ‎12-05-2009

3 disks in a raid 5 (that's really on 2 disks of iops) is not nearly enough spindles per lun with 40vm's on it....

You are only getting something like 300iops or so....

I would also think a whole tray in a raid 10 might be a better option. That's way more iops and if you are not doing a whole lot of snapshots and/or backups during the day.

While it doesn't follow the 10-15 vm rule it would be a whole lot better then 40 vm's sharing 300 iops...

Rumple · ‎12-05-2009

Won't hurt that's for sure....less activity the better...

kopper27 · ‎12-06-2009

well the test went good

service console changed - storage was balanced and also was balanced on ESXs - and I powered off VMs not used.

what we found today was VMs not being used just powered on had high Disk Usage at that time

also I am adding Vmware Support Summary

1. From the networking side, Hasan did not find any problems

2. From the storage side, we noticed some contentions due to over commit.

3. From inside the VMs, we noticed some disk paging errors.

4. Also from inside the VMs, we noticed that all of them are set to update their AV definitions at the same time.

Recommendations:

1. Try to lower the number of VMs per LUN.

2. Try setting the time where the AV definition update happens to something after business hours, maybe 3:00 AM, depending on your environment of course

3. Try to change the AV definition update timing, I suggest grouping the VMs in smaller batches of 10 or so and set each batch to update in 30 minutes intervals:

+ Batch1 (vm0..vm10): update at 3:00+

+ Batch2 (vm11..vm20): update at 3:30+

Scissor · ‎12-07-2009

I saw that you are running Symantec. Here is what I had to do to solve this problem in my ESX environment:

-Schedule the AV Definition updates to happen after hours on the server.

-Disable the "quick scan" that happens when a client downloads new AV updates: http://service1.symantec.com/SUPPORT/ent-security.nsf/docid/2006032313184248?Open&src=tranus_con_sl&...

-Reduce the number of simultaneous clients updates that the AV Server performs, as well as throttle the speed at which the AV server sends each update. I think I set mine to 2 threads and a 750 ms delay: http://service1.symantec.com/support/ent-security.nsf/854fa02b4f5013678825731a007d06af/2e726ee5a5d07...

- If you have any scheduled client AV scans, stagger their scheduling so that you don't have multiple clients on the same LUN scanning themselves at the same time.

kopper27 · ‎12-07-2009

thanks a lot Scissor

That was we were trying to do with my customer but did not pay attention until yesterday about any windows - AV updates - scan or anythin

they unistalled Symantec from 1 VM and that one worked fine today I tested it myselft

BTW thanks a lot for the Links pretty helpful that helped me to show them that it's not me just saying uninstall that stuff

I will assign points tomorrow :smileygrin:

Rumple · ‎12-07-2009

another thing that can help alot is making sure the .default profile in the registry has screensavers disabled as well as use a GPO to disable screensavers (or at least go to a blank one with locking).

screensavers running with no one logged in as well as screensavers like pipes and other crap will cause idle machines to needed scheduled CPU time when you don't expect it to.

kopper27 · ‎12-25-2009

well it ended up being a combination of the storage not really balanced and the AV

thanks a lot guys :smileygrin:

Erik_Zandboer · ‎12-25-2009

So basically you have 7*2 disks = 14 disks that actually perform IOPS in your setup over 270 VMs.

The other way round: Lets say we can still design your storage layout. So you have 270 desktops. Lets assume each desktop uses on average 10 IOPs (VMware states 5.6 I think, but we see bigger values most of the times). So you'd need 270*10 = 2700 iops in your setup. A single 15K spindle delivers about 150 IOPS, so you would need about 2700/150 = 18 spindles. So at least 18 effective spindles. So you could use 18 15K-disks in RAID10, or 5x (41)RAID5 = 25 15K disks in RAID5. I do not know the IBM storage good enough, but often (read: with EMCs), RAID5 (41) or (81) are better performers than for example your (21). This has to do with the way parity is calculated in the controllers.

Now it all comes to finance and required disk space (not performance from here on). Lets say you use 300GB 15K drives. The RAID10 would give you 9300GB = 2,7 TB and RAID5 would give you 54*300 = 6TB of storage.

I would ALWAYS recommend to use RAID10 in a VDI solution. We see in VDI deployments more writes than reads, especially when you use 512MB of memory inside the vDesktops (swapfiles inside XP). RAID10 is a way better performer in writes (See Duncans excellent story on write penalties and more at http://www.yellow-bricks.com/2009/12/23/iops/ ). Another often forgotten problem with RAID5 is rebuild time and even more often forgotten is the performance impact on rebuilding a RAID5. Most of the time VDI performance drops below acceptable levels when a RAID5 set rebuilds (or it takes days to complete if you lower the priority). RAID10 is very simple and fast in rebuilds: It simply fills the new disks with a copy from its mirror-buddy. Little impact, high speed.

Another nice thing about RAID10 as you can see in the example above, is that when you do not need the storage AMOUNT (enter linked clones!), you are actually CHEAPER with RAID10, needing less spindles for the required performance.

Gonna spend some blogging on these items shortly

Visit my blog at http://www.vmdamentals.com

All

VMs Pretty poor performance but...