I am currently in a View 5.0.0 environment that currently has about 160 to 180 users logging in per day. We are basically an 8 - 5pm shop. (County government). I am having an issue with a Performance parameter I am seeing on all 10 ESX servers I have in my VDI farm. Just about once every 2 or 3 hours or less, I am getting a spike, anywhere between 100 to 400, in the Disk (ms) parameter. It occurs for a mere minute or so, and drops back down to less than 10 ms latency. This also occurs during non-peak VDI hours. It occurs through out the night as well in the same time increment. This occurs on at the exact same time across all ESX servers. We are using HP EVAs (fiber channel) for our back end storage.
I am confident that “some” VDI end users are either loosing connection to their VDI desktop, or their desktop basically freezes, or both. The end user can reconnect after a minute or so. The VDI end user has reported exact times they get disconnected which lines up identical to these spikes.
Since all ESX servers are seeing this hiccup, I know it is not VM or ESX specific.
So my question to my fellow VDI admins, are there any VMware related Advanced settings within the vSphere client that can address such Disk latency spikes? I have a case open with HP as well. I have changed the Disk.Maxio.Size default setting 32,767k to the amount of 128k. This modification did not resolve the spike I described above. Please see the attach screen shots of the performance parameter I have described….
So the screen shots show the highest latency counter. What do the other more VM specific counters show as far as ready latency or write latency? Do you see specific VMs that are generating a lot of IO or is accross the board.
To be completely it seems like there is some sort of scheduled process that kicks off in those intervals on all VMs and that potentially saturates your storage. In my experience the easiest thing to do is to identify the scheduled process and then see if you can randomize it or maybe it's not needed at all.
Thanks for the reply mittim12:
I agree. I appears to be a reoccuring event or process on the SAN. I just got the latest "purp" again today on the ESX server right at 10:05 am.
I have now reviewed the VM's Write and Read Disk latency. I have copied the screen shot into a JPEG and submitted it to this case. So I am not sure if there is any relvance or not....Since I don't have a base line of these two parameters, I am not sure if these values seen in the screen shot (18 ms for peak Read Latency) is an issue or not....
Do you have AV running on the VMs? If so, check your def update logs and see if they correspond with the disk I/O spikes. Same for Persona Management if you're using it (although by default that updates every 10 minutes unless you've changed it).
Got any sort of replication or snapshotting happening on the SAN at those intervals?
The most likely thing causing this is some sort of agent or scanner that is running at these intervals on all of these desktops. Since it's taking about a minute for all VMs to drop back down it's possible that the process that's running could take only a short period of time to run on a physical system but since all of the VMs are sharing storage and running this operation at the same time it could be extending the time it takes to run.
Can you get a list of all agents that are running inside these desktop VMs?
I used Stratusphere form Liquids Labs for that kind of problems.
This will capture all the performance counter of all your VMs and will compile it so you can generate reports and see exactly witch programs on your VDI are causing this.
And I guest you follow up the best practice of vmware... (disabling windows search, disk defrag and the 1xxx microsoft scheduled tasks...)
does it always occur at the same time of the day, or is it sporadic? i guess it is happening to all the virtual desktops? since you get about a minute of this latency spike, i'd recommend running esxtop and check out the disk stats. i.e., run esxtop, type 'v' to show stats for all VMs. then check the stats columns for all the VMs to see if the I/O hitting the storage is read or write, and whether it affects every single VM on the esx box. for more info on esxtop usage, check out blog post on http://www.supersonicdog.com/2012/09/05/esxtop5min/ or http://www.yellow-bricks.com/esxtop/
What kind of storage are you using? what type of IO/latency/CPU usage is being reported on the storage side?