Hi all,
We are experiencing about 30mins of slowness on our ESX cluster of 8 HP Blades once a day at the same time, they are all SAN attached. The problem started being noticed a few weeks ago and now known is actually fairly evident. The slowness occurs roughly between 8.45am and 9.25am everyday. Does anybody know of anything that could cause the whole cluster to slow periodially. We have ruled out AV software scans and the Storage Team claim that nothing is being done to the SAN at this time.
Is there any way of getting some good stats out of the ESX host or is there any way of getting cluster specific stats that may indicate any processes or resource bottlenecks that may be occuring?
This is only happening on our TEST cluster and not our LIVE cluster, but the testers and developers are starting to get fairly annoyed each morning for half an hour - any help would be greatly appreciated.
Thankyou,
Jericho
This definitely sounds storage related, even though your san guys say nothing is going on.
Around the time that you expect the slowness to happen, open up VirtualCenter, click on a host, go to the Performance tab, Change Chart Options. Expand Disk and click on Real-time. Choose Chart Type of Stacked Graph (per VM). Click the All button under Objects to select all VMs and the server itself. In Counters area, make sure only Disk Usage (Average/Rate) is checked. Click OK and watch the graph.
You are looking for all the VMs around the same time to start an upward trend in disk KBps used. Get on some of the VMs and look in the event log to see what's going on. In our case, an A/V update was performed and a small scan of the A/V files was taking place.
If the VM disk usage graph for all VMs on a particular host doesn't show anything any different, get on the console of that host and run esxtop -d 2 (delay of 2 secs, the smallest possible). Then hit the 'd' key to get to the disk usage. Maximize your console screen to see all the columns. Look for high values in MBREAD/s or MBWRTN/s and especially LOAD and %USD as wel as QUED. If you have a high QUED, then you need to increase the queue depth on your HBAs. If you have high LOAD and CMD/s then your VMs are doing a lot of disk activity, along with high MB READ/WRTN/s. Watch this before the event happens to get a handle on what normal I/O looks like.
If esxtop shows nothing out of the ordinary during the slowdown, go back to the san guys and ask them to check san fabric port performance, errors, health, etc.
In another situation that we had, we didn't see high loads of disk i/o in either esxtop or the performance graphs. We did eventually dig in to our SAN performance tools and found that our SAN replication was having a hard time keeping up during the time frame when our event happened daily. We were able to trace it to a faulty SFP which functioned fine with loads of under 5 mb/s but over that it started dropping frames and the SAN synchornization would get backed up and everything would slow down for about an hour until it caught up and traffic dropped on the one SFP.
During this particular event, we noticed poor performance on our other SAN-connected servers and running sar -d on our HPUX servers for example showed incredibly high latencies while waiting for blocks to be written. We didn't notice it before because VMware is more sensitive to disk performance than the HPUX servers and their application.
I think you definitely need to take a closer look at the SAN infrastructure. We stressed about this for several weeks and it's exactly the same situation you describe. Sluggish or not responding VMs daily around the same time.
Hi Jericho,
A bit more information on your enviornment would help us get an idea of what is going on......E.G. VMware Version, How many hosts? configuration? 3rd party software?
If it was me I would proceed as follows.
1. Review performance counter through VirtualCenter
CPU, memory, I/O, Network,
See if you can pin point a specific VM that is causing issues..
2. If you have backups schedules...check timing/activity.
3. Check I/O Activity on SAN (physically look, or use GUI management) during the 30minutes...lights flashing, high IO?..check if it is a specific LUN.
4. Start to turn off Test/Dev VM's before 8:45 to rule out one by one.
5. Check your ESX host patch levels, check your VM tools versions.
Hope this helps.
best regards
Bernie.
It may also be useful to look at the performance fom a VM perspective. For a Windows VM look for the obvious things such as memory, cpu etc, but also look at i/o response times and %disk idle. Low %disk idle could indicate performance issues on your SAN. Does this affect all VMs? If so are they all on the same LUN & Raid Group?
Hi Berniebgf,
The whole cluster is 8 ESX hosts BL465c with 16GB RAM in each, running 3.02, there are about 130 guest OS's. DRS and HA are on. The vast majority of these are not being used very much at all, and from the anlaysis of each of the ESX hosts they have plenty of available RAM and processing power.
All VCB Backups / File level backups are performed between 6pm and 6am. The Storage Team have claimed that there is no unusual SAN activity during these hours.
Quite a few of the VMtools versions say out of date - I will upgrade all of these and double check all the ESX patch levels although I think 3.02 pretty much brought all of them up to speed, I will repost after this has been done.
Thanks for your help so far.
Hello again.....
I guess we missed one of the main points.......who is complaining about performance and from what perspective?
What we know:
You have told us that your developers are complaining..... so we presume they are complaining about their VM performance?
But is the slowness they are complaining about in reference to the VirtualCenter management of their VM's or the VM performance itself?
So to get a exact idea of the issue we should know EXACTLY who is complaining and what are the specifics..
1. Who? (persons)
2. What? (What VM's...specific VM's they manager...map it out)
3. Info? (how are they seeing this slowness?
E.G: in managing the VM through VirtualCenter?
in managing through Terminal services?
in the general performance of the machine?
DB testing?
If we know more of the above we may be able to pinpoint the issue a bit further.....
best regards
Bernard
The actual machine is responding very slowly and if a VM is restarted it takes a long time to come back up (6 mins to shutdown, 4 mins loading to queue job in VC, 20 mins come back up). Very slow! After 9.30am all machines repond perfectly fine, it is definitely something in the background each day. Anybody know of any good linux commands that might be able to identify new processes starting?
I have also had one report that VC was slow but I have never witnessed this myself. VC is running as a VM on our production cluster which runs fine. I have logged a call to the storage guys to do some more indepth analysis for the I/O for the LUNS that these hosts are presented to. I think I need to gather some more info and come back in a day or two.
Just a thought - is there any kind of reporting process that might run and collate statistics once a day??
You could use ESXTOP and parse it to a csv file.
Credit to http://www.xtravirt.com
Description: Using esxtop command to provide performance reports
If you have a need to generate performance statistics from your ESX servers you have the option of running the ESX Service Console command esxtop in batch mode.
You can use esxtop to log performance information in batch mode to a .csv file which can then be analysed using a spreadsheet application or Windows Perfmon.
For example:
esxtop -b -d perfstats.csv
Direct Link: http://www.xtravirt.com/index.php?option=com_content&task=view&id=109&Itemid=65
regards
Bernard
This definitely sounds storage related, even though your san guys say nothing is going on.
Around the time that you expect the slowness to happen, open up VirtualCenter, click on a host, go to the Performance tab, Change Chart Options. Expand Disk and click on Real-time. Choose Chart Type of Stacked Graph (per VM). Click the All button under Objects to select all VMs and the server itself. In Counters area, make sure only Disk Usage (Average/Rate) is checked. Click OK and watch the graph.
You are looking for all the VMs around the same time to start an upward trend in disk KBps used. Get on some of the VMs and look in the event log to see what's going on. In our case, an A/V update was performed and a small scan of the A/V files was taking place.
If the VM disk usage graph for all VMs on a particular host doesn't show anything any different, get on the console of that host and run esxtop -d 2 (delay of 2 secs, the smallest possible). Then hit the 'd' key to get to the disk usage. Maximize your console screen to see all the columns. Look for high values in MBREAD/s or MBWRTN/s and especially LOAD and %USD as wel as QUED. If you have a high QUED, then you need to increase the queue depth on your HBAs. If you have high LOAD and CMD/s then your VMs are doing a lot of disk activity, along with high MB READ/WRTN/s. Watch this before the event happens to get a handle on what normal I/O looks like.
If esxtop shows nothing out of the ordinary during the slowdown, go back to the san guys and ask them to check san fabric port performance, errors, health, etc.
In another situation that we had, we didn't see high loads of disk i/o in either esxtop or the performance graphs. We did eventually dig in to our SAN performance tools and found that our SAN replication was having a hard time keeping up during the time frame when our event happened daily. We were able to trace it to a faulty SFP which functioned fine with loads of under 5 mb/s but over that it started dropping frames and the SAN synchornization would get backed up and everything would slow down for about an hour until it caught up and traffic dropped on the one SFP.
During this particular event, we noticed poor performance on our other SAN-connected servers and running sar -d on our HPUX servers for example showed incredibly high latencies while waiting for blocks to be written. We didn't notice it before because VMware is more sensitive to disk performance than the HPUX servers and their application.
I think you definitely need to take a closer look at the SAN infrastructure. We stressed about this for several weeks and it's exactly the same situation you describe. Sluggish or not responding VMs daily around the same time.
You might also want to look at the network for anything weird happening around that time. ESX puts its NICs in promiscuous mode, so if you have a ton of traffic, you might incur additional cost processing it.
chouse,
You seem to have solved my problem, I had another look at the AV and there was a virus definition update that was set to update at 9am, although this itself was not much of a performance hit on the disks, it initiated a small scan of the registry and of files currently loaded in the memory. We had two options - to disable the post definition update scan or move the time of the scan. I moved the time to 7.30am and the peaks we experienced were seen between 7.30am and 8am which is acceptable.
The update schedule is actually set per update server and not AV server groups so this must have been happening on our prod servers as well, in fact we did see a spike on the performance charts in prod but there was no noticible difference hence us investigating a problem with the test cluster SAN. After talking to the SAN guys the difference is probably not in the disk speeds but to do with the cache on the seperate diskgroups. Our EMC Symmetrix in LIVE has 64GB and the Clariion has 6GB (or something similar) of cache.
Thanks to everybody for your help. I have only been monitoring for one day but I think this has solved it.
Regards,
Jericho (and some happy developers)
That's great to hear!
We ended up moving our VDI infrastructure off our HP EVA arrays because of such a low amount of cache (8GB) to HP XP arrays with something like 64GB of write cache and things have been great even with updates and patches applied to large numbers of the VMs at the same time.
It's interesting to hear it was happening to your server VMs as well, but obviously less noticeable because normally people aren't remoted in to those all day like they would be for virtual desktops. The VDI users are always the first to know when there's a problem, even before my status checks can notify me!