VM becomes unresponsive, some vCPUs are 100% utili...

edwinstarkey · ‎11-28-2011

Hi all,

I have a cluster of about 10 hosts and 150 VMs. There are times when a Linux VM becomes unresponsive. The vCPU usage skyrockets and flat-lines at 100% until the VM is reset. The guest is not pingable and the console doesn't display anything nor respond to keyboard input. I've waited to see if the problem goes away on it's own (up to 4 hours), but it does not - a hard reset of the VM is the only solution.

It happens to various VMs, but most frequently to those running 32-bit Debian 6.0. The problem is not isolated to one VM, but there is a set of four VMs that experience the problem more often than others.

Some of the VMs have 8 vCPUs, but some only have 1 or 2. VMware support keeps saying the number of vCPUs should be reduced to 1 with more added as needed, but we've gone through that already - our VMs have the number of vCPUs that they need. I looked closely at the graphs of the 8 vCPU guests while the problem is happening. Sometimes there were 4 vCPUs pegged and 4 were idle, sometimes 5 were pegged, etc. There doesn't appear to be a pattern.

Our environment consists of Dell PowerEdge servers (1950, R810, R900) connected to an EMC Clariion SAN using 4Gb FC. All hosts are running ESXi, either 4.1 or 5.0. VMs are stored on various LUNs on the SAN using a range of RAID levels (5, 6, 10). At one point I thought the SAN utilization might be the source of the problem, so I setup a dedicated RAID-10 RAID group with a single LUN which contained only one VM. That VM was still affected. Also, in terms of SAN controller utilization, it's usually around 15% and only peaks at 30%.

All of the VMs use the vmxnet3 vNIC. On one occasion, after resetting the VM, I looked at the kernel log and it displayed some vmxnet3 messages which led me to believe the problem may be caused by http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=200571.... I moved some VMs to the host running ESXi v5.0 and the problem still occurred, so that theory was debunked. Also, this problem has happened quite a few times but those vmxnet3 kernel messages only showed up on one occasion.

All VMs are running VMware tools. Some have the open-vm-tools Debian package, some use the official VMware version. I enabled verbose logging in VMware Tools but that didn't return any helpful info.

There is no gradual increase in CPU/memory/utilization of any kind which would indicate an over-worked VM. It will be humming along just fine and then out of the blue the CPU will immediately spike and flatline at 100%. The guests are monitored using cacti and sar. Neither of these tools report any utilization increase leading up to the event. Also, there is no pattern regarding when the problem happens. It could be in the middle of the workday when load is high or in the middle of the night when load is practically nil.

I've experienced this problem off and on since I started using VMware several years ago. I'm just bringing it up now because it seems to be happening more often lately.

Since the problem cannot be reproduced on-demand or at least at regular intervals, it's difficult to determine what's causing it.

Here are the facts:
- It happens on various VMs running different applications, some are affected more than others.
- The guest OS is always Linux. It's happened on Debian 5.0, Debian 6.0, and Gentoo. It may affect Windows guests but we have very few of those.
- It happens on various hosts with differing hardware configurations.
- It happens on ESXi 4.1 and 5.0.
- It happens on when the VM is stored on a Clariion SAN, regardless of which RAID type is used (5, 6, 10).
- It happens at random times, regardless of the guest utilization level.
- The guest doesn't "know" what happens. There are no log entries indicating anything before or after the event (except in one isolated vmxnet3 case which I explained).
- It has been happening more often lately. It's hard to provide exact numbers, but it seems like it used to happen maybe once every 3 months. Now it happens about every other week.
- There haven't been any significant changes to the hardware or software infrastructure which correlate with this problem.

Now, based on all of that information, logic tells me that something is wrong with VMware. It's the only factor I can find which is common in all situations. I opened a case and have been frustrated with the help VMware has provided.

1. I called while the problem was happening. A tech logged into the host and viewed CPU stats with esxtop. Then he put me on hold. I assumed he was looking something up in the KB or asking an associate for advice. 5 minutes later he came back and said Debian 6.0 is only supported on hosts running ESXi 5.0 (it was running on a 4.1 host). He would provide no further support whatsoever. Alright, I understand that, it's not in the support matrix. But spending 10 minutes looking at the problem would have been appreciated.
2. I upgraded one of the hosts to 5.0 and experienced the same problem a few days later. I called support again and mentioned the fact that it is now running in a supported configuration. He ran esxtop and noted the high vCPU usage. He stated there wasn't any more he could do, it must be a problem with the application running inside the guest. I specifically asked if he could run strace to see what the runaway processes were doing, he said no. I also asked if the VM could be forcibly crashed (coredump) to examine the aftermath. He said no and insisted the problem lies with some application running in the guest. His recommendations were to reduce the number of vCPUs, and/or schedule regular reboots of the guests. I had a good chuckle over the latter, then became dismayed when I realized he was being dead serious.
3. I escalated the case to a supervisor who repeated the technicians statements - there's nothing more VMware can do, something is wrong on our side. I explained that not all of the affected VMs run the same application - and there is no pattern.

Honestly that just doesn't make sense to me, the evidence points to a bug or some other problem in VMware.

Anyway, I'm an open-minded person and accept the fact that it may indeed be a problem with our internal application. How can I troubleshoot this further to find the source of this problem?

I do have a 'test subject' available for poking and prodding. While the problem was affecting one of the VMs, I suspended it, copied the entire VM directory to another volume, then resumed the original VM. CPU usage spiked right back up to where it used to be so I reset the guest. But I do have the copy of the suspended VM which may be resumed at any time to reproduce the problem.

Any help would be greatly appreciated.

PUREJOY · ‎11-29-2011

Hi

Understand you are not happy with the way things are going (or not going) for you. I have no affilication with VMware, that said, what you are seeing may be bug in the linux kernel on the distro you are running.

Just to make sure you are not hitting known bugs, can you take a "problem VM" and upgrade it to the latest patch level and re-run the test. that would eliminate some suspicion.

If that doesn't work, when the problem occures, You can force a panic on the guest linux (by doing echo c > /proc/sysrq-trigger)

and send the core to the linux community to analyze. Make sure you have coredump configured on the guest linux vm before you do that.

You could then do a dump analysis or look at the panic strings in the web to see if there are known problems.

CPU spikes of this kind is due to some process holding a lock for excessive amount of time, or a busy spin loop which

will starve the machine (and burn the cpu 100%).

There is nothing one could do if the guest is not behaving correctly. I have seen this sort of behavior when i was

writing linux drivers and due to bugs in the code, the test VM would get hosed.

Let us know what you find out,

Architect @ Pure Storage || www.purestorage.com || http://www.purestorage.com/blog/ || http://twitter.com/#!/purestorage ||@ravivenk || VCAP-DCA5, VCP 4, VCP 5

edwinstarkey · ‎11-29-2011

PUREJOY,

Regarding upgrading the kernel to the latest patch level, I'm not sure what you mean by that. Download the latest source from kernel.org and compile it by hand? I'd rather not do that. Right now the VM is running the very latest packaged kernel from Debian squeeze. Even if I did compile the latest kernel by hand and the problem never happened again, I wouldn't consider that a solution since I'd have no idea what the source of the problem was. Also, if there were any other future un-related issues, VMware woudln't offer any support due to the custom kernel.

As for inducing a core dump when the problem occurs (echo c > /proc/sysrq-trigger), that isn't possible. I can't SSH in since the VM is unreachable over the network, and I can't console in from the vSphere client since it doesn't respond there either.

Like I said, I'd be glad to examine further from within the guest, but I'm not sure how. Right before the problem happens, the guest thinks everything is fine. Then the problem crops up and CPU skyrockets. After forcibly resetting it, the VM boots fine and does not report any details on what happened.

It happened again last night and I collected diagnostic information as described here: http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&e... Unfortunately I don't know how to interpret this data. The docs say the data should be sent to VMware support for examination, but I tried that already, they didn't help. That's why I'm posting this discussion - to see if others know of ways to root out the cause of this problem.

PUREJOY · ‎11-30-2011

I meant to say, if there are any known patches to the version of the OS you are running you would have to apply that before you start the tests. Common practice in the windows and RHEL world (that I am familiar with). Most distro has a patch upgrade available. I didn't mean to convey to build the kernel, that would be quite cumbersome and not practical.

There is a way to panic the guest from ESX by sending some magic sequence when the VM is stuck. I don't remember that now. VMware support should be able to help there.

That way one could gather some info, as the is no way to get to the VM in the "Stuck" state.

Architect @ Pure Storage || www.purestorage.com || http://www.purestorage.com/blog/ || http://twitter.com/#!/purestorage ||@ravivenk || VCAP-DCA5, VCP 4, VCP 5

drivera01 · ‎11-30-2011

We have a predominately Linux shop running on esxi4.1 update 1. I am having the EXACT same thing happen to me, We have had 6 VM just up and hang, always having CPU spike. I also open a support request with VMware, basically they came back with...so sorry..to bad. To their defense, how the heck do you try and figure out an issue that only happens once in a while.

I do not know about you, but the system did have a few log entires. I honestly think it is a kernel error as stated previously.

I think it is the kernel bug: : divide by zero bug in find_busiest_group

I was scouring the internet for hours referencing what little log information I could gather. I found a decent amount of info to at least warrant the in depth look into this possible solution.

As with you I suspected VMware in the begining, but while trying to look at this from a perspective of "outside the box" I believe there may be another problem.

What is your kernel level:

#uname -a

edwinstarkey · ‎11-30-2011

One of the affected VMs is running 2.6.32-5-686-bigmem. It's the latest stable kernel in 32-bit Debian 6.

Linux localhost 2.6.32-5-686-bigmem #1 SMP Mon Oct 3 05:03:32 UTC 2011 i686 GNU/Linux

drivera01 · ‎11-30-2011

That is the kernel level I keep seeing for the bug

Here are a few links to check out.... Keep in mind, from what I read, this is a kernel bug that is hitting many different distros of linux, so yuo may find information on distros you are not running. I did.

Again, this is only an educated guess as to the problem, but so far I am really pointing this way....

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/824304 (in reference to ubuntu, but they reference the symptoms)

http://kernel.opensuse.org/cgit/kernel-source/diff/patches.fixes/sched-divide-by-zero-workaround-and...

http://kernel.opensuse.org/cgit/kernel-source/commit/?id=425886b2e0f44b7cb6691ecd1da693b70dc878cc query keywords: ("update_cpu_power: cpu_power" kernel )

edwinstarkey · ‎11-30-2011

drivera01, have you found a solution? Have those patches fixed the problem for you? Also, where did you see log entries? In my case the console displays nothing and after reboot there aren't any messages in /var/log/kernel. I was thinking about adding a virtual serial port and having that logged to a file, maybe it would capture the oops.

drivera01 · ‎11-30-2011

I am just starting the search, a coworker of mine has offered to open a support case with Novel, but I do not know when that will happen.

The only messages I have seen are in /var/log/messages* we run Suse

you might want to open a support case if you have support.

Sorry thats all I know

David

ulli-2206 · ‎12-28-2011

Hi,

today one of our machines showed the same problem. It did became unresponsive and started using 100% vCPU.

Guest OS is SLES 11 SP1, Kernel 2.6.32.

I didn't find the problem that caused the crash but i wanted to share what i found out so far, maybe it is helpful:

I examined /var/log/messages and found several messages repeating over and over again until the crash:

Dec 28 08:54:36 vmguest01 kernel: [83801.848812] update_cpu_power: cpu_power = 83814858285515; scale_rt = 18428813490058355147
Dec 28 08:54:36 vmguest01 kernel: [83801.849402] update_cpu_power: cpu_power = 83814858881712; scale_rt = 18428813490058951344
Dec 28 08:54:36 vmguest01 kernel: [83801.851079] update_cpu_power: cpu_power = 83814860558860; scale_rt = 18428813490060628492
Dec 28 08:54:36 vmguest01 kernel: [83801.851757] update_cpu_power: cpu_power = 83814861225898; scale_rt = 18428813490061295530
Dec 28 08:54:36 vmguest01 kernel: [83801.852086] update_cpu_power: cpu_power = 83814861566880; scale_rt = 18428813490061636512

I went back to the first occurance of such a line and found something interesting at a point about 24 hours before the crash:

Dec 27 09:35:18 vmguest01 kernel: [18446744062.849328] ------------[ cut here ]------------
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849345] WARNING: at /usr/src/packages/BUILD/kernel-default-2.6.32.36/linux-2.6.32/kernel/sched.c:3847 update_cpu_power+0x151/0x160()
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849348] Hardware name: VMware Virtual Platform
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849349] Modules linked in: iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables tun nfs lockd fscache nfs_acl auth_rpcgss sunrpc af_packet st ide_cd_mod joydev vsock(X) vmci(X) vmmemctl(X) acpiphp microcode fuse loop dm_mod rtc_cmos ppdev rtc_core parport_pc rtc_lib parport ac shpchp container pci_hotplug floppy intel_agp button tpm_tis tpm tpm_bios sg mptctl i2c_piix4 pcspkr e1000 i2c_core sr_mod cdrom sd_mod crc_t10dif edd ext3 mbcache jbd fan processor ide_pci_generic piix ide_core ata_generic ata_piix libata mptspi mptscsih mptbase scsi_transport_spi thermal thermal_sys hwmon vmw_pvscsi scsi_mod vmxnet3
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849405] Supported: Yes
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849407] Pid: 11077, comm: export2online-e Tainted: G X 2.6.32.36-0.5-default #1
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849408] Call Trace:
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849426] [<ffffffff810061dc>] dump_trace+0x6c/0x2d0
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849436] [<ffffffff81397d88>] dump_stack+0x69/0x71
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849445] [<ffffffff8104d804>] warn_slowpath_common+0x74/0xd0
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849453] [<ffffffff8103d771>] update_cpu_power+0x151/0x160
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849458] [<ffffffff8103e3b3>] find_busiest_group+0xa83/0xce0
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849461] [<ffffffff810460dd>] load_balance_newidle+0xcd/0x380
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849465] [<ffffffff81398b7b>] thread_return+0x2a7/0x34c
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849468] [<ffffffff81399b9d>] do_nanosleep+0x8d/0xc0
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849475] [<ffffffff81068768>] hrtimer_nanosleep+0xa8/0x140
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849483] [<ffffffff8108b0e6>] compat_sys_nanosleep+0xd6/0x150
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849488] [<ffffffff81033c19>] sysenter_dispatch+0x7/0x27
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849496] [<00000000ffffe430>] 0xffffe430
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849497] ---[ end trace d164ede915d60e57 ]---
Dec 27 09:35:18 vmguest01 kernel: [18446744062.849498] update_cpu_power: cpu_power = 2149207707; scale_rt = 18428729677349277339
Dec 27 09:35:18 vmguest01 kernel: [18446744062.850151] update_cpu_power: cpu_power = 2150067177; scale_rt = 18428729677350136809

The most interesting part is the timestamp: 18446744062.850151.

I went through the sources of the linux kernel and it seems that this value gives the number of seconds since boot up. So this would mean boot up was about 18 billion seconds ago, that's nearly 585 years. Internally this value seems to be handled in nanoseconds. Expressed in nanoseconds, the value is 1.84 * 10^19. This is near to the maximum value an unsigned long integer (2 ^ 64) can hold. And a few lines later, it seems to have overflown:

Dec 27 09:35:28 vmguest01 kernel: [18446744072.839931] update_cpu_power: cpu_power = 12139859978; scale_rt = 18428729687339929610
Dec 27 09:35:28 vmguest01 kernel: [18446744072.840340] update_cpu_power: cpu_power = 12140269170; scale_rt = 18428729687340338802
Dec 27 09:35:33 vmguest01 kernel: [ 4.123417] __ratelimit: 909 callbacks suppressed
Dec 27 09:35:33 vmguest01 kernel: [ 4.123420] update_cpu_power: cpu_power = 17132887892; scale_rt = 18428729692332957524

The values of cpu_power and scale_rt kept increasing monotonically throughout the log file. I guess the point the machine crashes is when the value hits a critical maximum and may cause an overflow and/or negative values somewhere within the kernel.

Maybe the "high" timestamp value is caused by an adjustment to the guest OS's clock done by VMware? Maybe this adjustment causes the time to become a negative value (causing an underflow ending in a very high positive value)?

I hope someone finds this helpful

Best regards

Thorsten

ulli-2206 · ‎12-28-2011

Hi,

in the meantime i found out that the problem is caused by a bug in the Linux kernel: A timer overflow occurs after 208 days uptime.

Kernel 2.6.32 seems to be affected, earlier kernel versions do not seem to have this bug.

More information about this bug (and a patch) can be found here:

http://www.gossamer-threads.com/lists/engine?do=post_view_flat;post=1371841;page=1;sb=post_latest_re...

Best regards,

Thorsten

All

VM becomes unresponsive, some vCPUs are 100% utilized