VMware Cloud Community
Frigo42
Contributor
Contributor
Jump to solution

Ballooning memory stuck in "is resetting"

On our recent SLES 12 SP4 servers, we notice that the VMWare driver installed on client side is not properly releasing memory after ballooning:

The host (ESXi 6.5.0) does not show any ballooning:

> vmware-toolbox-cmd stat balloon

0 MB

However the guest driver shows dozen of gigabytes being "in resetting":

> cat /sys/kernel/debug/vmmemctl

balloon capabilities:  0x1e

used capabilities:      0x1e

is resetting:          y

target:              8204474 pages

current:            8204474 pages

timer:                40802

doorbell:              2467

start:                    1 (  0 failed)

guestType:                1 (  0 failed)

2m-lock:                905 (  0 failed)

lock:                  21771 (  0 failed)

2m-unlock:              1461 (  0 failed)

unlock:                20602 (  0 failed)

target:                40801 (  1 failed)

prim2mAlloc:          240083 (  94 failed)

primNoSleepAlloc:  10905780 (  4 failed)

primCanSleepAlloc:    218318 (  0 failed)

prim2mFree:          225282

primFree:          10449504

err2mAlloc:                0

errAlloc:                  0

err2mFree:                0

errFree:                  21

doorbellSet:              1

doorbellUnset:            2

This leaks a huge volume of memory and causes all kinds of instabilities. The only workaround we found is to reboot the guest.

Further investigation shows that this is caused by a new feature introduced in recent kernel + ESX: the ability to use a doorbell to resume the ballooning work.

In our case, we can see in the sys logs:

vmw_balloon: vmballoon_send_batched_lock - batch ppn 17fd56c, hv returns 7

shortly after, we can observe one thread stuck in D state

> ps -e -o pid,state,cmd | grep 'D \[kworker'

182935 D [kworker/8:2]

and a thread dump shows that the ballooning driver is stuck:

Workqueue: events_freezable vmballoon_work [vmw_balloon]

Call Trace:

? __schedule+0x292/0x880

schedule+0x32/0x80

schedule_timeout+0x1e6/0x300

? __wait_rcu_gp+0xcf/0xf0

wait_for_completion+0xa3/0x110

? wake_up_q+0x70/0x70

vmci_doorbell_destroy+0x8d/0xc0 [vmw_vmci]

vmballoon_vmci_cleanup+0x43/0x70 [vmw_balloon]

vmballoon_work+0x7f/0x65e [vmw_balloon]

process_one_work+0x14c/0x390

worker_thread+0x47/0x3e0

kthread+0xff/0x140

? max_active_store+0x60/0x60

? __kthread_parkme+0x70/0x70

ret_from_fork+0x35/0x40

add results of further investigation with dyndbg

Tags (1)
1 Solution

Accepted Solutions
anadav
VMware Employee
VMware Employee
Jump to solution

Thanks for your report. I think the following patch should fix it: https://lkml.org/lkml/2019/8/20/1447

View solution in original post

3 Replies
asajm
Expert
Expert
Jump to solution

Hi Frigo42

Check

VMware Knowledge Base

If you think your queries have been answered
Marking this response as "Solution " or "Kudo"
ASAJM
Reply
0 Kudos
anadav
VMware Employee
VMware Employee
Jump to solution

Thanks for your report. I think the following patch should fix it: https://lkml.org/lkml/2019/8/20/1447

Frigo42
Contributor
Contributor
Jump to solution

yes!

I finally managed to properly validate:

With a 255GB RAM ESX 6.5 host, running 4 VMs of 128GB memory -

on each VM, have a small binary that malloc progressively up to 100GB. It gets OOM killed (by vmware tools it seems!) and restarts.

I also reset some of the VMs from time to time.

With that set-up, the problem is reproducible easily (modprobe -r vmw_balloon get stuck).

After applying the fix we see no more of these issues...moreover the problem appears to be fixed on all our machines where we deployed the fix.

Thanks for the fix

Reply
0 Kudos