fauzigo
Contributor
Contributor

Kernel tracebacks after vSphere upgrade to 6.0

Jump to solution

Hi all,

We have this vAPP in vCloud that VMs have been working flawlessly since day 1. But after we updated our vSphere environment to version 6.0, VMs started rebooting themselves all the sudden and they wouldn't remain up for more than like 5 days at the most. Then we disabled VM monitoring on those VMs at the cluster's configuration and now HA won't reboot those VMs but we are seeing all kind of kernel trace calls, like for example:

Jul  5 06:19:35 compu kernel: hrtimer: interrupt took 42859741512 ns

Jul  5 06:19:36 compu kernel: INFO: rcu_sched self-detected stall on CPU

Jul  5 06:19:36 compu kernel: INFO: rcu_sched detected stalls on CPUs/tasks: { 3} (detected by 0, t=69334 jiffies, g=9237069, c=9237068, q=5)

Jul  5 06:19:36 compu kernel: Task dump for CPU 3:

Jul  5 06:19:36 compu kernel: kworker/3:2     R  running task        0 22078      2 0x00000088

Jul  5 06:19:36 compu kernel: sched: RT throttling activated

Jul  5 06:19:36 compu kernel: Workqueue: events_freezable vmballoon_work [vmw_balloon]

Jul  5 06:19:36 compu kernel: { 3}  (t=73140 jiffies g=9237069 c=9237068 q=5)

Jul  5 06:19:36 compu kernel: 0000000000000000 0000000000000200 000000003e25cd65 ffffffff81f0a540

Jul  5 06:19:36 compu kernel: 0000000000000202 0000000000000000 0000000000000000 ffff880716cd0000

Jul  5 06:19:36 compu kernel: ffff88041e9d3dc0 ffffffff811cf94a ffff880716cd0000 0000000000000000

Jul  5 06:19:36 compu kernel: Call Trace:

Jul  5 06:19:36 compu kernel: [<ffffffff811cf94a>] ? alloc_pages_current+0xaa/0x170

Jul  5 06:19:36 compu kernel: [<ffffffffa03b5ff8>] ? vmballoon_work+0x3b8/0x618 [vmw_balloon]

Jul  5 06:19:36 compu kernel: [<ffffffff810a845b>] ? process_one_work+0x17b/0x470

Jul  5 06:19:36 compu kernel: [<ffffffff810a9296>] ? worker_thread+0x126/0x410

Jul  5 06:19:36 compu kernel: [<ffffffff810a9170>] ? rescuer_thread+0x460/0x460

Jul  5 06:19:36 compu kernel: [<ffffffff810b0a4f>] ? kthread+0xcf/0xe0

Jul  5 06:19:36 compu kernel: [<ffffffff810bfde3>] ? finish_task_switch+0x53/0x180

Jul  5 06:19:36 compu kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140

Jul  5 06:19:36 compu kernel: [<ffffffff81697698>] ? ret_from_fork+0x58/0x90

Jul  5 06:19:36 compu kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140

Jul  5 06:19:36 compu kernel: Task dump for CPU 3:

Jul  5 06:19:36 compu kernel: kworker/3:2     R  running task        0 22078      2 0x00000088

Jul  5 06:19:36 compu kernel: Workqueue: events_freezable vmballoon_work [vmw_balloon]

Jul  5 06:19:36 compu kernel: ffff880716cd0000 000000003e25cd65 ffff880c3fd83c48 ffffffff810c47e8

Jul  5 06:19:36 compu kernel: 0000000000000003 ffffffff81a1e6c0 ffff880c3fd83c60 ffffffff810c80c9

Jul  5 06:19:36 compu kernel: 0000000000000004 ffff880c3fd83c90 ffffffff81137960 ffff880c3fd901c0

Jul  5 06:19:36 compu kernel: Call Trace:

Jul  5 06:19:36 compu kernel: <IRQ>  [<ffffffff810c47e8>] sched_show_task+0xa8/0x110

Jul  5 06:19:36 compu kernel: [<ffffffff810c80c9>] dump_cpu_task+0x39/0x70

Jul  5 06:19:36 compu kernel: [<ffffffff81137960>] rcu_dump_cpu_stacks+0x90/0xd0

Jul  5 06:19:36 compu kernel: [<ffffffff8113b0b2>] rcu_check_callbacks+0x442/0x720

Jul  5 06:19:36 compu kernel: [<ffffffff810f3610>] ? tick_sched_handle.isra.13+0x60/0x60

Jul  5 06:19:36 compu kernel: [<ffffffff81099697>] update_process_times+0x47/0x80

Jul  5 06:19:36 compu kernel: [<ffffffff810f35d5>] tick_sched_handle.isra.13+0x25/0x60

Jul  5 06:19:36 compu kernel: [<ffffffff810f3651>] tick_sched_timer+0x41/0x70

Jul  5 06:19:36 compu kernel: [<ffffffff810b4d72>] __hrtimer_run_queues+0xd2/0x260

Jul  5 06:19:36 compu kernel: [<ffffffff810b5310>] hrtimer_interrupt+0xb0/0x1e0

Jul  5 06:19:36 compu kernel: [<ffffffff81050ff7>] local_apic_timer_interrupt+0x37/0x60

Jul  5 06:19:36 compu kernel: [<ffffffff81699e4f>] smp_apic_timer_interrupt+0x3f/0x60

Jul  5 06:19:36 compu kernel: [<ffffffff8169839d>] apic_timer_interrupt+0x6d/0x80

Jul  5 06:19:36 compu kernel: [<ffffffff8108f5e8>] ? __do_softirq+0x98/0x280

Jul  5 06:19:36 compu kernel: [<ffffffff810f12fb>] ? clockevents_program_event+0x6b/0xf0

Jul  5 06:19:36 compu kernel: [<ffffffff816991dc>] call_softirq+0x1c/0x30

Jul  5 06:19:36 compu kernel: [<ffffffff8102d365>] do_softirq+0x65/0xa0

Jul  5 06:19:36 compu kernel: [<ffffffff8108f9d5>] irq_exit+0x115/0x120

Jul  5 06:19:36 compu kernel: [<ffffffff81699e55>] smp_apic_timer_interrupt+0x45/0x60

Jul  5 06:19:36 compu kernel: [<ffffffff8169839d>] apic_timer_interrupt+0x6d/0x80

Jul  5 06:19:36 compu kernel: <EOI>  [<ffffffff8118ac84>] ? get_page_from_freelist+0x434/0x9f0

Jul  5 06:19:36 compu kernel: [<ffffffff8118b3b6>] __alloc_pages_nodemask+0x176/0x420

Jul  5 06:19:36 compu kernel: [<ffffffff811cf94a>] alloc_pages_current+0xaa/0x170

Jul  5 06:19:36 compu kernel: [<ffffffffa03b5ff8>] vmballoon_work+0x3b8/0x618 [vmw_balloon]

Jul  5 06:19:36 compu kernel: [<ffffffff810a845b>] process_one_work+0x17b/0x470

Jul  5 06:19:36 compu kernel: [<ffffffff810a9296>] worker_thread+0x126/0x410

Jul  5 06:19:36 compu kernel: [<ffffffff810a9170>] ? rescuer_thread+0x460/0x460

Jul  5 06:19:36 compu kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0

Jul  5 06:19:37 compu kernel: [<ffffffff810bfde3>] ? finish_task_switch+0x53/0x180

Jul  5 06:19:37 compu kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140

Jul  5 06:19:37 compu kernel: [<ffffffff81697698>] ret_from_fork+0x58/0x90

Jul  5 06:19:37 compu kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140

Jul  6 02:19:37 compu kernel: INFO: rcu_sched detected stalls on CPUs/tasks: { 3} (detected by 1, t=75951 jiffies, g=11716833, c=11716832, q=7)

Jul  6 02:19:37 compu kernel: Task dump for CPU 3:

Jul  6 02:19:37 compu kernel: kworker/3:0     R  running task        0  3126      2 0x00000088

Jul  6 02:19:37 compu kernel: Workqueue: events_freezable vmballoon_work [vmw_balloon]

Jul  6 02:19:37 compu kernel: ffff880291e33d98 0000000000000046 ffff880700474e70 ffff880291e33fd8

Jul  6 02:19:37 compu kernel: ffff880291e33fd8 ffff880291e33fd8 ffff880700474e70 ffff880291e30000

Jul  6 02:19:37 compu kernel: 00000000000000ab 00000000000000ab 0000000000000000 ffffffffa03b8558

Jul  6 02:19:37 compu kernel: Call Trace:

Jul  6 02:19:37 compu kernel: [<ffffffff810c2076>] __cond_resched+0x26/0x30

Jul  6 02:19:37 compu kernel: [<ffffffff8168c9ea>] _cond_resched+0x3a/0x50

Jul  6 02:19:37 compu kernel: [<ffffffffa03b6029>] vmballoon_work+0x3e9/0x618 [vmw_balloon]

Jul  6 02:19:37 compu kernel: [<ffffffff810a845b>] process_one_work+0x17b/0x470

Jul  6 02:19:37 compu kernel: [<ffffffff810a9296>] worker_thread+0x126/0x410

Jul  6 02:19:37 compu kernel: [<ffffffff810a9170>] ? rescuer_thread+0x460/0x460

Jul  6 02:19:37 compu kernel: [<ffffffff810b0a4f>] kthread+0xcf/0xe0

Jul  6 02:19:37 compu kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140

Jul  6 02:19:37 compu kernel: [<ffffffff81697698>] ret_from_fork+0x58/0x90

Jul  6 02:19:37 compu kernel: [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140

I opened a case with VMware support looking for help regarding this issue, but all I got was a "go check out the OS or the applications" response. To be honest issues started next day after we updated to vSphere 6.0. We updated only half of the nodes in the cluster so the other half was still running on 5.5. furthermore, we split VMs 50-50 on 6.0 and 5.5, and the VMs hosted on 5.5 nodes wouldn't shown that kind of behavior. Since we eventually had to complete the update on the rest of the nodes they are all now running on version 6.0 and the issue is far from gone.

The issue in question comes in for VMs within this vAPP. The vCloud account has more than enough resources to provide for the VMs in that pool reservation. The problem happens only on VMs running CentOS 7 with kernel 3.10 or 3.11.

Now, I read something on a different post somewhere else that it might be an incompatibility issue between the hypervisor and said kernel version, as the Linux flavor running on those VMs should be the last or our concerns. Before trying installing kernel 4.x from unofficial repositories, I was wondering if someone has encountered with the same situation, and want to share their approach if the actually troubleshooted or are having the same problem so we can share information and hopefully nail this down.

Thanks in advance.

1 Solution

Accepted Solutions
fauzigo
Contributor
Contributor

Updating the kernel to version 4.12 using elrepo repository did the job. We are no longer getting kernel call traces anymore.

Not exactly the most elegant solution since kernel 4.x are no yet available on official repositories for CentOS7.

View solution in original post

0 Kudos
8 Replies
dekoshal
Hot Shot
Hot Shot

This looks close to kernel backtrace you are getting

https://translate.google.co.in/translate?hl=en&sl=es&u=https://bugs.centos.org/print_bug_page.php%3F...

According to below VMware KB this could be because of resource over commitment.

Soft lockup messages from Linux kernel running in an SMP-enabled virtual machine (1009996) | VMware ...

Further, In above vmware kb article it's mentioned "Many Linux kernels have a soft lockup watchdog thread, and report soft lockup messages if that watchdog thread does not get scheduled for more than 10 seconds. "

The watchdog_thresh with default can be changed to a higher value (reference link below)to stop the messages similar to BUG: soft lockup detected on CPU#1!

however you may want to check the resource allocation as well

CentOS 7 - kernel: BUG: soft lockup - CPU#3 stuck for 23s! [rcuos/1:19] - CentOS

If you found this or any other answer helpful, please consider the use of the Correct or Helpful to award points.

Best Regards,

Deepak Koshal

CNE|CLA|CWMA|VCP4|VCP5|CCAH

fauzigo
Contributor
Contributor

Hi,

I've gone through those same articles multiple times, shutting the soft lockup up from pooping up in the logs, but the issue persist, call traces still showing and the VM start misbehaving.

Procedures suggested there are just to get rid of the soft locks message in the logs.

0 Kudos
dekoshal
Hot Shot
Hot Shot

Try Upgrading the VMware tools  on the problem VM's and see if that helps.

If you found this or any other answer helpful, please consider the use of the Correct or Helpful to award points.

Best Regards,

Deepak Koshal

CNE|CLA|CWMA|VCP4|VCP5|C

0 Kudos
fauzigo
Contributor
Contributor

Already tried that, no difference.

0 Kudos
dekoshal
Hot Shot
Hot Shot

If there is no resource constraint and VMware tools is already running on latest version, all i can suggest is to use vmxnet3 network adapter. If that's already in place, I think this issue could very well be a BUG.

Using a different, preferably latest kernel version may help.

If you found this or any other answer helpful, please consider the use of the Correct or Helpful to award points.

Best Regards,

Deepak Koshal

CNE|CLA|CWMA|VCP4|VCP5|CCAH

0 Kudos
fauzigo
Contributor
Contributor

At this point, I believe that's all I have.

The VMs are in a pool that has a reservation greater than what's allocated to all VMs, there's no constrictions at all.

The network adapter is vmxnet3 for all VMs. So will try kernel 4.4 on CentOS 7 and see what the results are.

0 Kudos
fauzigo
Contributor
Contributor

Updating the kernel to version 4.12 using elrepo repository did the job. We are no longer getting kernel call traces anymore.

Not exactly the most elegant solution since kernel 4.x are no yet available on official repositories for CentOS7.

View solution in original post

0 Kudos
dekoshal
Hot Shot
Hot Shot

Hi fauzigo,

Good to know that.

Cheers!!!

If you found this or any other answer helpful, please consider the use of the Correct or Helpful to award points.

Best Regards,

Deepak Koshal

CNE|CLA|CWMA|VCP4|VCP5|CCAH