VMware Communities
peter222
Contributor
Contributor

kernel BUG: soft lockup - CPU stuck for ... Workstation 12 Pro

Hello All,

I hope somebody can help to fix this problem as soon as possible.

So I had an old Vmware Server 2.0.X what I replaced with the latest Vmware Workstation 12 PRO. This is run on host Debian Wheezy 7.9 with kernel 3.2.0-4-amd64.

My problem is that my vms are hanging for couple of seconds sometimes and I have no idea how to start debugging what's causing that. I'm sure it is not the hardware because the new hardware 2times more powerful that the old one was (CPU/Memory/Disk wise).

The HDD the vms stored on can easily do:

Timing cached reads:   15036 MB in  2.00 seconds = 7526.71 MB/sec
Timing buffered disk reads: 344 MB in  3.00 seconds = 114.49 MB/sec

And monitoring it by iotop on the host only gives me average 3-400 KB/sec usage.

A quick search for the following problem yields a lot of results on the net suggesting the exact same solution:

http://customerhelp.co.za/linux/ubuntu/fix-ubuntu-bug-soft-lockup-cpu-stuck-vmware-server.html

Turn off APIC and ACPI. Although this seems to be squelch the problem, it will NOT solve it on any way.

The VM guest also has increased latency while it hangs, for example (ping is done directly from the host where it is bridged to, the latency should never go over 1mS):

4 bytes from 10.0.0.5: icmp_req=16 ttl=64 time=0.228 ms

64 bytes from 10.0.0.5: icmp_req=17 ttl=64 time=2641 ms

64 bytes from 10.0.0.5: icmp_req=18 ttl=64 time=1719 ms

64 bytes from 10.0.0.5: icmp_req=19 ttl=64 time=719 ms

64 bytes from 10.0.0.5: icmp_req=20 ttl=64 time=0.204 ms

64 bytes from 10.0.0.5: icmp_req=21 ttl=64 time=0.245 ms

...

64 bytes from 10.0.0.5: icmp_req=235 ttl=64 time=23387 ms
64 bytes from 10.0.0.5: icmp_req=236 ttl=64 time=22387 ms
64 bytes from 10.0.0.5: icmp_req=237 ttl=64 time=21387 ms
64 bytes from 10.0.0.5: icmp_req=238 ttl=64 time=20387 ms
64 bytes from 10.0.0.5: icmp_req=239 ttl=64 time=19387 ms
64 bytes from 10.0.0.5: icmp_req=240 ttl=64 time=18387 ms
64 bytes from 10.0.0.5: icmp_req=241 ttl=64 time=17388 ms
64 bytes from 10.0.0.5: icmp_req=242 ttl=64 time=16388 ms
64 bytes from 10.0.0.5: icmp_req=243 ttl=64 time=15388 ms
64 bytes from 10.0.0.5: icmp_req=244 ttl=64 time=14388 ms
64 bytes from 10.0.0.5: icmp_req=245 ttl=64 time=13388 ms
64 bytes from 10.0.0.5: icmp_req=246 ttl=64 time=12389 ms
64 bytes from 10.0.0.5: icmp_req=247 ttl=64 time=11389 ms
64 bytes from 10.0.0.5: icmp_req=248 ttl=64 time=10389 ms
64 bytes from 10.0.0.5: icmp_req=249 ttl=64 time=9389 ms
64 bytes from 10.0.0.5: icmp_req=250 ttl=64 time=8390 ms

Here is a detailed kern log from the guest what happens inside:

[Wed Jan 27 07:27:06 2016] BUG: soft lockup - CPU#1 stuck for 90s! [kworker/1:2:149]
[Wed Jan 27 07:27:06 2016] Modules linked in: ext2 loop coretemp crc32c_intel vmwgfx ttm drm aesni_intel aes_x86_64 i2c_piix4 snd_pcm vmw_balloon psmouse aes_generic i2c_core snd_page_alloc snd_timer snd evdev cryptd soundcore serio_raw pcspkr processor ac container button shpchp thermal_sys power_supply ext4 crc16 jbd2 mbcache dm_mod sg sd_mod crc_t10dif ata_generic mptspi ata_piix scsi_transport_spi mptscsih mptbase e1000 libata scsi_mod [last unloaded: scsi_wait_scan]
[Wed Jan 27 07:27:07 2016] CPU 1
[Wed Jan 27 07:27:07 2016] Modules linked in: ext2 loop coretemp crc32c_intel vmwgfx ttm drm aesni_intel aes_x86_64 i2c_piix4 snd_pcm vmw_balloon psmouse aes_generic i2c_core snd_page_alloc snd_timer snd evdev cryptd soundcore serio_raw pcspkr processor ac container button shpchp thermal_sys power_supply ext4 crc16 jbd2 mbcache dm_mod sg sd_mod crc_t10dif ata_generic mptspi ata_piix scsi_transport_spi mptscsih mptbase e1000 libata scsi_mod [last unloaded: scsi_wait_scan]

[Wed Jan 27 07:27:07 2016] Pid: 149, comm: kworker/1:2 Not tainted 3.2.0-4-amd64 #1 Debian 3.2.73-2+deb7u2 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
[Wed Jan 27 07:27:07 2016] RIP: 0010:[<ffffffff810eb8ac>]  [<ffffffff810eb8ac>] arch_local_irq_enable+0x4/0x8
[Wed Jan 27 07:27:07 2016] RSP: 0018:ffff88001a479de8  EFLAGS: 00000206
[Wed Jan 27 07:27:07 2016] RAX: ffff88001f84c018 RBX: 0000000000000003 RCX: ffff880016f0d000
[Wed Jan 27 07:27:07 2016] RDX: 0000000000000010 RSI: ffff88001f84c2a8 RDI: ffff88001f84c158
[Wed Jan 27 07:27:07 2016] RBP: ffff88001f83d840 R08: ffff880010b04900 R09: ffff880010b049c0
[Wed Jan 27 07:27:07 2016] R10: ffff880010b04900 R11: ffff880010b049c0 R12: ffff88001f84b500
[Wed Jan 27 07:27:07 2016] R13: ffff880010add040 R14: ffff880010add040 R15: ffffffff810bb378
[Wed Jan 27 07:27:07 2016] FS:  0000000000000000(0000) GS:ffff88001d220000(0000) knlGS:0000000000000000
[Wed Jan 27 07:27:07 2016] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[Wed Jan 27 07:27:07 2016] CR2: 00007fc1d3803140 CR3: 000000001aa71000 CR4: 00000000000006e0
[Wed Jan 27 07:27:07 2016] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[Wed Jan 27 07:27:07 2016] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[Wed Jan 27 07:27:07 2016] Process kworker/1:2 (pid: 149, threadinfo ffff88001a478000, task ffff88001eddc840)
[Wed Jan 27 07:27:07 2016] Stack:
[Wed Jan 27 07:27:07 2016]  ffffffff810ec4cb 0000000100226f3e ffff880000000000 ffff88001f84b500
[Wed Jan 27 07:27:07 2016]  ffff88001f83d840 0000000000000000 0000000000000000 ffff88001d22f7f0
[Wed Jan 27 07:27:07 2016]  ffffffff810ec6c3 0000000000000000 00000000ffffffff ffff88001f819980
[Wed Jan 27 07:27:07 2016] Call Trace:
[Wed Jan 27 07:27:07 2016]  [<ffffffff810ec4cb>] ? drain_array+0xbc/0xc9
[Wed Jan 27 07:27:07 2016]  [<ffffffff810ec6c3>] ? cache_reap+0xfa/0x1a2
[Wed Jan 27 07:27:07 2016]  [<ffffffff8105b7eb>] ? process_one_work+0x161/0x269
[Wed Jan 27 07:27:07 2016]  [<ffffffff8105c7b4>] ? worker_thread+0xc2/0x145
[Wed Jan 27 07:27:07 2016]  [<ffffffff8105c6f2>] ? manage_workers.isra.25+0x15b/0x15b
[Wed Jan 27 07:27:07 2016]  [<ffffffff8105f901>] ? kthread+0x76/0x7e
[Wed Jan 27 07:27:07 2016]  [<ffffffff81358ff4>] ? kernel_thread_helper+0x4/0x10
[Wed Jan 27 07:27:07 2016]  [<ffffffff8105f88b>] ? kthread_worker_fn+0x139/0x139
[Wed Jan 27 07:27:07 2016]  [<ffffffff81358ff0>] ? gs_change+0x13/0x13
[Wed Jan 27 07:27:07 2016] Code: db b6 6d db b6 6d 48 c1 fb 03 48 0f af da 48 89 58 20 5b c3 90 90 90 57 9d 66 66 90 66 90 c3 fa 66 66 90 66 66 90 c3 fb 66 66 90 <66> 66 90 c3 50 9c 58 66 66 90 66 90 48 89 04 24 e8 df ff ff ff
[Wed Jan 27 07:27:07 2016] Call Trace:
[Wed Jan 27 07:27:07 2016]  [<ffffffff810ec4cb>] ? drain_array+0xbc/0xc9
[Wed Jan 27 07:27:07 2016]  [<ffffffff810ec6c3>] ? cache_reap+0xfa/0x1a2
[Wed Jan 27 07:27:07 2016]  [<ffffffff8105b7eb>] ? process_one_work+0x161/0x269
[Wed Jan 27 07:27:07 2016]  [<ffffffff8105c7b4>] ? worker_thread+0xc2/0x145
[Wed Jan 27 07:27:07 2016]  [<ffffffff8105c6f2>] ? manage_workers.isra.25+0x15b/0x15b
[Wed Jan 27 07:27:07 2016]  [<ffffffff8105f901>] ? kthread+0x76/0x7e
[Wed Jan 27 07:27:07 2016]  [<ffffffff81358ff4>] ? kernel_thread_helper+0x4/0x10
[Wed Jan 27 07:27:07 2016]  [<ffffffff8105f88b>] ? kthread_worker_fn+0x139/0x139
[Wed Jan 27 07:27:07 2016]  [<ffffffff81358ff0>] ? gs_change+0x13/0x13

This basically makes the server unusable for any production activity and incredibly irritating working on it that my ssh just freeze sometimes like there would be a network issue (there aren't any).

So my question is how to go on debugging that what actually causes the VMs to hang?

I run a great variety of guests going from win2k3 server, bsd, to linux and they sort of acting the same.

Thank you.

Reply
0 Kudos
5 Replies
admin
Immortal
Immortal

Please post a vmware.log file.

Reply
0 Kudos
peter222
Contributor
Contributor

Hello,

I have edited the original post, see it attached (log for the last couple of days).

Reply
0 Kudos
admin
Immortal
Immortal

It's possible that this is Intel erratum BD105.  Please update your microcode to patch level 0x15 and see if that helps.  (Check with your system vendor for a BIOS update that would include this microcode patch.)

Reply
0 Kudos
peter222
Contributor
Contributor

Hi,

Thank for your answer. Did you mean:

BT105. X X X No Fix LBR May Contain Incorrect Information When Using

                   FREEZE_LBRS_ON_PMI

I hope not since that is not fixed. But this is a fairly old processor:

model name    : Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz

The microcode being used right now is: microcode       : 0x13

Upgrading it with

ii  intel-microcode                       1.20150121.1                       amd64        Processor microcode firmware for Intel CPUs

definitely worth a shot. I let you know if it fixed the problem after reboot.

Reply
0 Kudos
admin
Immortal
Immortal

No; I meant BD105:

BD105. Execution of VMPTRLD May Corrupt Memory If Current-VMCS Pointer is Invalid

Problem: If the VMCLEAR instruction is executed with a pointer to the current-VMCS (virtual-machine control structure), the current-VMCS pointer becomes invalid as expected. A subsequent execution of the VMPTRLD (Load Pointer to Virtual-Machine Control Structure) instruction may erroneously overwrite the four bytes at physical address 0000008FH.

Implication: Due to this erratum, the four bytes in system memory at physical address 0000008FH may be corrupted.

Workaround: It is possible for BIOS to contain a workaround for this erratum.

Reply
0 Kudos