VMware Communities
swordfeng
Contributor
Contributor

Issue in Workstation 12.5.5 Linux vmnet module

I discovered occasional 'kernel stack overflow' error with RIP register pointing to function 'VNetBridgeNotify'.

Then I took a look into the source code and found there may be a bug in vmnet/bridge.c:

--- bridge.c 2017-05-14 02:24:23.764324763 +0800

+++ bridge_new.c 2017-05-14 02:24:20.494352085 +0800

@@ -1146,7 +1146,7 @@

void *data) // IN: device pertaining to event

{

VNetBridge *bridge = list_entry(this, VNetBridge, notifier);

- struct net_device *dev = (struct net_device *) data;

+ struct net_device *dev = netdev_notifier_info_to_dev(data);

switch (msg) {

case NETDEV_UNREGISTER:

It can be found in other similar source code that 'netdev_notifier_info_to_dev' is used to extract the struct net_device. For example, linux/drivers/net/ppp/pppoe.c - Elixir - Free Electrons

6 Replies
dariusd
VMware Employee
VMware Employee

Thank you very much for reporting this!  I'll see that it gets fixed.

Cheers,

--

Darius

Reply
0 Kudos
dariusd
VMware Employee
VMware Employee

Hi again, swordfeng!

Linux kernel 3.11 changed the usage of the 3rd argument to the notifier function from a struct net_device * to a struct netdev_notifier_info *.  Your patch will fix it neatly for newer Linux kernels, but the official fix we have queued up ends up being a little bit more complicated because we still support host OSes older than kernel 3.11.  :smileygrin:

We haven't seen any other reports of a stack overflow though... The misuse of the pointer is very nasty, but I haven't yet seen how a stack overflow could result.  Is there any chance you could share more details of the failure?  Information on which Linux distribution you're running might be useful, and/or a photo of the panic message on your screen.

Thanks,

--

Darius

Reply
0 Kudos
swordfeng
Contributor
Contributor

Hi, dariusd!

Nice to hear that this is going to be fixed soon.

I'm current using Arch Linux, kernel 4.10.13, running patched modules (https://aur.archlinux.org/cgit/aur.git/tree/?h=vmware-modules-dkms​) which seems unrelated to this issue however.

I have taken a picture and have kept the kernel module which produced the 'stack overflow'. And here's a piece of disassembly:

0000000000004390 <VNetBridgeNotify>:

    4390: e8 00 00 00 00       callq  4395 <VNetBridgeNotify+0x5>

    4395: 55                   push   %rbp

    4396: 48 89 e5             mov    %rsp,%rbp

    4399: 41 54                 push   %r12

    439b: 53                   push   %rbx

    439c: 48 89 fb             mov    %rdi,%rbx

    439f: 48 83 ec 08           sub    $0x8,%rsp

    43a3: 48 83 fe 02           cmp    $0x2,%rsi

    43a7: 0f 84 bf 00 00 00     je     446c <VNetBridgeNotify+0xdc>

    43ad: 48 83 fe 06           cmp    $0x6,%rsi

    43b1: 0f 84 8c 00 00 00     je     4443 <VNetBridgeNotify+0xb3>

    43b7: 48 83 fe 01           cmp    $0x1,%rsi

    43bb: 74 0b                 je     43c8 <VNetBridgeNotify+0x38>

    43bd: 48 83 c4 08           add    $0x8,%rsp

    43c1: 31 c0                 xor    %eax,%eax

    43c3: 5b                   pop    %rbx

    43c4: 41 5c                 pop    %r12

    43c6: 5d                   pop    %rbp

    43c7: c3                   retq  

    43c8: 48 83 7f 28 00       cmpq   $0x0,0x28(%rdi)

    43cd: 75 ee                 jne    43bd <VNetBridgeNotify+0x2d>

    43cf: 48 81 ba 00 05 00 00 cmpq   $0x0,0x500(%rdx)

    43d6: 00 00 00 00

    43da: 75 e1                 jne    43bd <VNetBridgeNotify+0x2d>

    43dc: 4c 8d 67 18           lea    0x18(%rdi),%r12

    43e0: 48 89 d7             mov    %rdx,%rdi

    43e3: 48 89 55 e8           mov    %rdx,-0x18(%rbp)

    43e7: 4c 89 e6             mov    %r12,%rsi

    43ea: e8 00 00 00 00       callq  43ef <VNetBridgeNotify+0x5f>

Which convinces me that the invalid access into the *data causes this issue.

dariusd
VMware Employee
VMware Employee

Ahhhhh yes, that'd do it!  Not really a stack overflow in the most common/traditional sense (e.g. too many function calls or stack frames too large) but instead it's a wild pointer dereference that's "too far" past the current stack pointer and causes a page fault.  I agree that your patch will take care of it.

Thanks for the awesome help, swordfeng.  We greatly appreciate it!

Cheers,

--

Darius

Reply
0 Kudos
dariusd
VMware Employee
VMware Employee

I wasn't able to squeeze this fix into the Workstation 12.5.6 update which was released just now.  It's still in the queue for evaluation for the subsequent patch release.  No specific timeline to share, I'm afraid.  In the meantime, your patch should continue to work against Workstation 12.5.6.

Thanks again!

--

Darius

Reply
0 Kudos
dariusd
VMware Employee
VMware Employee

Workstation 12.5.7 has been released, and contains a fix for this issue which may cause host kernel panics or unreliable network bridging when run on Linux host kernel version 3.11 and newer.

VMware Workstation 12 Pro Version 12.5.7 Release Notes

Download VMware Workstation Pro

VMware Workstation 12 Player Version 12.5.7 Release Notes

Download VMware Workstation Player

Cheers,

--

Darius

Reply
0 Kudos