On our recent SLES 12 SP4 servers, we notice that the VMWare driver installed on client side is not properly releasing memory after ballooning:
The host (ESXi 6.5.0) does not show any ballooning:
> vmware-toolbox-cmd stat balloon
0 MB
However the guest driver shows dozen of gigabytes being "in resetting":
> cat /sys/kernel/debug/vmmemctl
balloon capabilities: 0x1e
used capabilities: 0x1e
is resetting: y
target: 8204474 pages
current: 8204474 pages
timer: 40802
doorbell: 2467
start: 1 ( 0 failed)
guestType: 1 ( 0 failed)
2m-lock: 905 ( 0 failed)
lock: 21771 ( 0 failed)
2m-unlock: 1461 ( 0 failed)
unlock: 20602 ( 0 failed)
target: 40801 ( 1 failed)
prim2mAlloc: 240083 ( 94 failed)
primNoSleepAlloc: 10905780 ( 4 failed)
primCanSleepAlloc: 218318 ( 0 failed)
prim2mFree: 225282
primFree: 10449504
err2mAlloc: 0
errAlloc: 0
err2mFree: 0
errFree: 21
doorbellSet: 1
doorbellUnset: 2
This leaks a huge volume of memory and causes all kinds of instabilities. The only workaround we found is to reboot the guest.
Further investigation shows that this is caused by a new feature introduced in recent kernel + ESX: the ability to use a doorbell to resume the ballooning work.
In our case, we can see in the sys logs:
vmw_balloon: vmballoon_send_batched_lock - batch ppn 17fd56c, hv returns 7
shortly after, we can observe one thread stuck in D state
> ps -e -o pid,state,cmd | grep 'D \[kworker'
182935 D [kworker/8:2]
and a thread dump shows that the ballooning driver is stuck:
Workqueue: events_freezable vmballoon_work [vmw_balloon]
Call Trace:
? __schedule+0x292/0x880
schedule+0x32/0x80
schedule_timeout+0x1e6/0x300
? __wait_rcu_gp+0xcf/0xf0
wait_for_completion+0xa3/0x110
? wake_up_q+0x70/0x70
vmci_doorbell_destroy+0x8d/0xc0 [vmw_vmci]
vmballoon_vmci_cleanup+0x43/0x70 [vmw_balloon]
vmballoon_work+0x7f/0x65e [vmw_balloon]
process_one_work+0x14c/0x390
worker_thread+0x47/0x3e0
kthread+0xff/0x140
? max_active_store+0x60/0x60
? __kthread_parkme+0x70/0x70
ret_from_fork+0x35/0x40
add results of further investigation with dyndbg
Thanks for your report. I think the following patch should fix it: https://lkml.org/lkml/2019/8/20/1447
Hi Frigo42
Check
Thanks for your report. I think the following patch should fix it: https://lkml.org/lkml/2019/8/20/1447
yes!
I finally managed to properly validate:
With a 255GB RAM ESX 6.5 host, running 4 VMs of 128GB memory -
on each VM, have a small binary that malloc progressively up to 100GB. It gets OOM killed (by vmware tools it seems!) and restarts.
I also reset some of the VMs from time to time.
With that set-up, the problem is reproducible easily (modprobe -r vmw_balloon get stuck).
After applying the fix we see no more of these issues...moreover the problem appears to be fixed on all our machines where we deployed the fix.
Thanks for the fix