thierry_itty
Contributor
Contributor

VMs have slack times, almost freeze

hello

we have a serious problem with several VMs. they get suddenly very slow, they almost freeze, for some seconds to a few minute, then they wake up again and work fine.

VIC performance cpu graphs show "inverted peaks" at those times (imagine a sharp V going down abruptly to 0)

configuration info :

\- ibm blade center H with 12 blade servers HS20, 2 Xeon 3.25, 5 GB ram, Fiber Channel

\- vmware esx server 3.0.0

\- guests windows nt4, 2003 and linux red hat 9, fc3, fc4 and es4

\- 2 cisco giga ethernet switches connected to cisco giga eth backbone

\- 2 mcdata 4 gbps fc switches connected to mcdata/hitachi full 4 gbps san

additional notes :

\- the linux vm's seem much more concerned than the windows

\- when the problem occurs several times in a row it is generally at a 60 minute interval (say for example at 3:21, then 4:21, then 5:21 and stop)

\- the problem may concern only one vm on a server, or several, or all

\- we found asolutely no valuable information in the vm logs (/var/log/messages and others) nor in the vmware logs (hostd-xx and so on)

\- there's no clue about any network problem (there's a cacti monitoring on all the ports of the blade switches and their backbone uplink ports)

\- there's no clue about any san problem (no monitoring evidence, thought...)

\- there were vm clock sync problems but they are solved (vmware tools checked and vmx configuration files checked)

\- the problem is not related to the vm's load nor to the hosting esx's load, we could even say that it is more likely to occur under average or low load rather than under heavy load

\- when the problem happens, everything is slow, even typing a command

\- we've been told that the V shapes in VIC graphs were due to a selective lack of monitoring data, as a problem per se, but obviously if the agent on the esx or the vm is not responding, the VI server won't get any data at this time...

well, it's like the vm takes a nap without caring for users (to say nothing of system admins...)

we've 2 cases opened at the support, but there's no improvement yet.

we'd really appreciate any advice

tia

thierry

0 Kudos
9 Replies
Texiwill
Leadership
Leadership

Hello,

Is it possible to move all but 1 VM off a single system. In this case does this problem occur for that one VM? What is running in that VM? Specifically any agents.

What CPUs are in the HS20s?

Have you verified that the time in the VMs is actually correct?

Best regards,

Edward

--
Edward L. Haletky
vExpert XIV: 2009-2022,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
0 Kudos
Jae_Ellers
Virtuoso
Virtuoso

Are you monitoring %ready in esxtop? What are the numbers?

-=-=-=-=-=-=-=-=-=-=-=-=-=-=- http://blog.mr-vm.com http://www.vmprofessional.com -=-=-=-=-=-=-=-=-=-=-=-=-=-=-
0 Kudos
admin
Immortal
Immortal

What sort of RAID sets are you running these VMs on? Are you performing any file system operations on the same VMFS volume during times of VM "freeze"?

I've noticed on our IBM DS4300 SAN that if I have a say a pair of 500GB SATA drives in RAID 1, all[/b] VMs on that volume are badly affected when I do something like clone a VM to that VMFS volume. There seems to be just too much disk contention for it to cope with intensive disk operations (like cloning VMDKs) without it bringing the other VMs to their knees. Stop the clone, all the VMs start working again.

I know it must be the SATA drives (and the fact they're only RAID 1) because I can clone VMDKs to a 7-set FC RAID5 volume without (noticeably) affecting any of the VMs running on that volume.

0 Kudos
Jae_Ellers
Virtuoso
Virtuoso

We ran into this w/ iSCSI on NetApp. We told mgmt that we needed FC disks to run the vms on, were told to use what we had, which were 500G SATA. After bumping the vms count up to 35 or so on 2 luns plus the existing overhead of our ClearCase vobs and NFS traffic the filer simply couldn't get enough data off of the SATA disks fast enough to fulfill the requests and we started to see a lot of retry errors.

We ended up moving the vms off to EVA SAN on FC disks and are now running 80+ vms on 6 or so luns with no issues.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=- http://blog.mr-vm.com http://www.vmprofessional.com -=-=-=-=-=-=-=-=-=-=-=-=-=-=-
0 Kudos
thierry_itty
Contributor
Contributor

Hello

in answer to Edward :

\- we could consider moving vms in order to keep only one on an esx, but this isn't that easy, read further on san config

\- all the blades have 2 Xeon 3.2 GHz and 5 GB ram

\- we checked that time was correct

in asnwer to Jae :

uh, no, we're not monitoring esxtop/%rdy, i should rtfm to have an idea of the "good" values we should read (btw any link to the right doc ?)

in asnwer to mitell :

\- no blade has internal disks. they are all connected full 4gbps to our mcdata/hitachi san (ams500) with 2 FC and 4 SATA raid groups (each raid 5 with 6+1 disks)

\- each esx has its own luns (system + vmfs)

\- most of the guests ("native" os installation) have their own luns (system + data)

\- some guests (moved through P2V from older physical machines) have only virtual disks, on the hosting esx dedicated vmfs lun

we experienced san overload during certain lun coupling operations, which led to (very) bad performance, but this problem has been addressed and anyway the symptoms are not the same. moreover, this could absolutely

for example, yesterday on one esx with four vms (say A B C and D), we had a slack instant at 12:25 for A B C and D, then one at 13:25 for A B C and D, then one at 14:25 for A B and D, and a last one at 15:25 only for D. A is p2v nt4, B p2v fc3, C p2v nt4, D p2v rh9, the 4 share the same vmfs

this morning, the same happened on another esx running 4 native rhel es4 each one with its own luns (3 slack instants at 8:55, 9:55 and 10:55

I'm still searching

thanks

0 Kudos
Jae_Ellers
Virtuoso
Virtuoso

Dunno where %ready is documented these days but it's usually a good idea to keep it below 5% using shares and/or resource pools. However some systems will be more demanding and can't fulfill this, but at least you'll be able to identify the cpu starved systems with this.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=- http://blog.mr-vm.com http://www.vmprofessional.com -=-=-=-=-=-=-=-=-=-=-=-=-=-=-
0 Kudos
Texiwill
Leadership
Leadership

Hello,

So outside of the P2V'd VMs all VMs use Raw LUNS? Raw Disk Maps? It really sounds like you are in an iowait state. I.e. waiting on IO for something. Perhaps the SAN. Are you seeing any SCSI errors in the logfile?

Best regards,

Edward

--
Edward L. Haletky
vExpert XIV: 2009-2022,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
0 Kudos
thierry_itty
Contributor
Contributor

Hello all

I managed to find a reproductible problem with clear evidence :

\- ssh to a linux vm, start a "top" monitoring with 5 sec interval (top -d 5)

\- ssh to the hosting esx, start a "esxtop" monitoring (5 sec default interval)

\- ssh to the same linux vm (2nd session), start a high cpu an io load command (say "find / -type f -exec grep something \{} \;") (remember all of the vm storage is on the san)

then we get a three step advance :

\- a) vm top shows that cpu load soars up to 100%, esxtop shows the same for the vm, everything looks correct for some seconds, then

\- b) vm top still shows a 100% cpu load, esxtop shows a cpu downfall to a few percents for the vm, vm top refreshes only once or twice a minute, while esxtop refreshes normally, until

\- c) when I cancel the command (Ctrl-C on the 2nd ssh session), vm top shows a cpu downfall to a few percents (normal activity) and refreshes every 5 seconds as it did during step a, esxtop goes on showing the same, everything looks correct again, but the clock of the vm which lost some minutes (depending on how long it was in the "b" state)

this has been tested on several blades, several esx, several vms, all with more or less the same results.

if I repeat the experiment somewhat later, the "a" state will last quite longer (say a whole minute or so), before entering the "b" state. this is obviously due to the fact that the command begins processing the data that has been cached by the previous run, then, when it is finished with cached data, the vm enters the "b" state again.

so I have to agree that we have an IO wait problem, it can't be anything else, but what about the vm clock drifting si hard ?

how do I go further and see whether the problem is on the vmware side on on the san side ?

TIA,

0 Kudos
fontyyy
Contributor
Contributor

I've noticed VM's that are in resource pools that have quite stringent restrictions tend to seem to stop and start a lot rather than just go slowly, especially if given large workloads, this seems even more apparent if the VM is on some kind of networked storage rather than a h/d local to the host. When removed from the RG everything goes back to normal.

I've also noticed if you just ping the VM that is exhibiting these symptoms the reply times are all over the place, 300+ ms is normal but once again, when removed from the RG it drops right back to <1ms.

0 Kudos