Solved: Re: VM Ware Fusion potentially causes macOS 10.15.... - Page 2

Alex_Ma · ‎07-23-2020

I recently upgraded to macOS 10.15.6. Now if I leave VM Ware Fusion running while leaving my mac for some time, e.g. - for a night, the system crashes. This did not happen with the older macOS 10.15.5 version.

VM Ware Fusion version: Professional Version 11.5.5 (16269456)

macOS Catalina version: 10.15.6 (19G73)

dariusd · ‎07-26-2020

Awesome. I'll set up the debugging environment...

--

Darius

xfoo · ‎07-26-2020

Thanks again for looking into this. To re-iterate, the leak is very visible with "sudo zprint -d", you can see the kalloc.32 zone increase immediately as long as the VM is running. And after a few hours, the logs indicate that the this "kalloc.32" zone has exceeded its maximum size, so the kernel kills almost all processes, and possibly respawns them but in a half uninitialized state? (Which is perhaps why there is a need to re-authenticate iCloud and so forth, it's like these processes are restarted with a blank state - perhaps they are unable to read their own prefs or keys).

A side question:

I went into my time machine backups and fished out /System/Library/Kernels/kernel from a 10.15.5 backup into a copy in a temporary directory for now. If this was Linux, I would simply choose the previous kernel in grub or whatever. Maybe this is too crazy on a mac, but would it be possible to maybe copy this 10.15.5 kernel file into "/System/Library/Kernels/kernel.10-15-5" and boot macOS 10.15.6 with a macOS 10.15.5 kernel? Googling didn't turn up too much, although there were some examples on the web for the kernel debug kit using a series of kextcache prelinking and nvram boot-args tricks. I'd be curious to see if "sudo zprint -d" shows kalloc.32 leaks on the 10.15.5 kernel without having to reinstall and restore the entire macOS installation.

Another side question: Would it be possible to use dtrace to get some (kernel?) stack traces for all the kalloc.32 zone allocations? Or maybe I'll just wait until someone who knows what they are doing - like you - gets to take a closer look. Best of luck & hoping for a fast resolution.

dariusd · ‎07-26-2020

I would not try mixing and matching kernels like that. You might get away with it, but ... yeah, lots could break along the way. Probably not worth the risk. (We have in the past seen customers with broken RAID mirrors who would unintentionally do this from time to time... The system firmware would load the bootloader from the stale mirror, the bootloader would load the kernelcache from the same disk, then the OS would load and find the newer disk, and surprise! the OS had been upgraded since the RAID mirror broke so now we're an old kernel running on a new system... and then things sometimes worked, and sometimes didn't...)

It's not uncommon for resource-starvation issues to cause subsequent corruption... e.g. something manages to read in some prefs or whatever, tries to write out some updates to disk, and ends up writing out a zero-byte file before falling over or getting killed or whatever... In many ways a resource-starvation failure is worse than a hard crash, because at least in a hard crash you just have everything stop all at once in a moment where everything is close to consistent, rather than having the rambling corpses of not-quite-dead processes wreaking havoc on your system in its final moments.

I wouldn't say that I necessarily know what I'm doing, but at least I have a toolkit of strategies which I can try to apply... and then various sizes of hammer...

--

Darius

mmsvs · ‎07-26-2020

Just FYI I looked more closely with print and yes, it's the very same situation with me, i.e kalloc.32 sticks out with multiple gigs allocated..

kalloc.16 16 21228K 29927K 1358592 1915344 1322593 4K 256 C

kalloc.32 32 2278736K 2588635K 72919552 82836328 72919465 4K 128 C

kalloc.48 48 13092K 13301K 279296 283754 265929 4K 85 C

dariusd · ‎07-26-2020

com.apple.security.sandbox is allocating millions and millions of blocks of memory containing just the text "/dev". Sigh.

Next step is to do a bit more tracing to see if I can figure out whether they are leaking or accumulating "live", and whether there is anything Fusion is doing which might unnecessarily exacerbate the problem (or that we can tweak to work around the problem).

--

Darius

Alex_Ma · ‎07-26-2020

I removed some ids, but otherwise there are three complete panic reports. The last one does not mention some Intel drivers because I have removed them from the system, thinking it might help, but it did not.

Alex_Ma · ‎07-26-2020

Also I noticed that on Apple forum some people report that they use Virtual Box, which seems to crash the new system even more often. With Fusion, I only observed this when I was away for a few hours. Now I try to quit VMware every night. I will see if this affects it later this week. It does not crash every day. Just FYI, I do not put my mac or hosted Windows into sleep mode.

Last time I saw an issue like this four years ago. Back then I had to disable Power Nap feature and it stopped crashing... until this update. I am guessing Apple introduced some power saving features again.

dariusd · ‎07-27-2020

Thanks for your patience, everyone.

We have narrowed down the problem to a regression in the com.apple.security.sandbox kext (or one of its related components) included in macOS 10.15.6 (19G73), and we have now filed a comprehensive report with Apple including a minimal reproduction case which should allow them to easily identify and address the issue.

We have not yet identified any workaround other than refraining from installing macOS 10.15.6, which is a bit painful (and the advice will come too late for anyone who finds this thread because they are already running into this problem), or shutting down your VMs whenever you aren't using them and rebooting your host every day or every few hours or every hour, which is ... ugggggh.

We'll keep investigating possible workarounds to see if we can pop out a point release with a mitigation... To Be Determined. (In all honesty, it isn't looking good, but we've come up with some mighty creative workarounds in the past, so I'll never say never.)

Thanks again to everyone here for the epic assistance and for being so utterly polite and patient despite the mess. Y'all are the best.

--

Darius

xfoo · ‎07-27-2020

That's great news, thanks for the amazingly quick turnaround in diagnosing this.

I really hope Apple doesn't drag its feet on this leaving 10.15.x stranded without a fix, and us having to wait for Big Sur in the fall :smileysilly:

mmsvs · ‎07-27-2020

Thanks for the update!

I wonder if Big Sur beta 3 has the same issue as 10.15.6?

jnordsving · ‎07-27-2020

I see Mac Rumors is reporting the problem now

https://www.macrumors.com/2020/07/27/vmware-confirms-macos-virtualization-bug-causes-crashes/

AndrewSolo · ‎07-27-2020

Had the same issues. Running 20 mac network. Server kept crashing/rebooting twice a day after upgrade to 10.15.6. WIndowServer crashes like others here have reported.

Updated 5 other computers - all crash when up for >12 hours with VMWare running.

No issues on 10.15.5 computers which haven't been updated.

Only solution - had to go back to 10.15.5 on server. Workstations now get shut down overnight.

This was the only way to fix the issue.

Joachim_Neudert · ‎07-27-2020

Same issue here. My Mac crashed two times, after some 72 hours of running VMWare Fusion. This never happened before.

(My report is not very helpful, I admit, I just wanted you to know that the problem seems to be quite frequent and widespread)

Greetings from bavaria

Joachim

Joachim_Neudert · ‎07-27-2020

Heise, authoritative computer magazine in german, reports it as well (they citate your work, Darius)

Virtualisierer auf dem Mac: Systemabstürze mit macOS 10.15.6 | heise online

Jadraker · ‎07-28-2020

Hello guys,

I'm experiencing the same issue. I had two KPs (one related to watchdog and the other related with linkage or something like that). In both cases my 32 GB of RAM were used at full.

I've a MBP 2018 with i9, 32 GB RAM, 1 TB and macOS 10.15.6. The guest OS I use is a Windows 10 Pro 2004.

Regards.

dmshaw · ‎07-28-2020

I am also running 10.15.6 and Fusion Pro 11.5.5 (16269456), but I am not seeing this problem. It is possible that this is tied in some way to the guest type? My only guests are Linux machines, and it seems most the folks in this thread having problems have Windows guests of one sort or another.

gilby101 · ‎07-28-2020

The steady increase in kalloc.32 is with both Windows and Linux VMs under 10.15.6. I don't have crashes, but that is because I have generous RAM and keep an eye on Wired/kernel memory.

I also have a steady (but slower) increase in kalloc.128 which started with 10.15.5, but I am pretty sure that is not due to Fusion. But don't know what.

dariusd · ‎07-28-2020

I am also running 10.15.6 and Fusion Pro 11.5.5 (16269456), but I am not seeing this problem. It is possible that this is tied in some way to the guest type?

Yes. The severity of the issue will vary with guest OS type, with the selection of virtual devices present in the VM (USB, sound, NICs), and with the guest OS workload – not just the amount of CPU load but also the way in which the workload interacts with the virtual hardware.

I will try to generalize how I would expect this to behave – with the caveat that this is largely guesswork, I have not taken any measurements to back this up, and there are simply too many factors at play (and interplay) to even hope to characterize them all:

Fewer vCPUs will generally fare better than more vCPUs.
Fewer peripherals will generally be better (removing unnecessary sound cards, USB controllers, NICs might help).
In a relatively "quiet" VM, a modern guest OS will generally fare better than an older guest OS. A tickless kernel will fare much better than a "tickful" kernel.
VMs running I/O-heavy workloads (network services, compilers, ...) will fare the worst, with most I/Os triggering the leak.
Idle guest processor cores (0% CPU) and fully busy processor cores (a CPU-bound workload, 100% CPU) will both fare better than guest processor cores with intermediate "noisy" CPU usage. A virtual processor core's transition from idle to busy and back again will trigger the leak.

There is no VM configuration that I'm aware of which will be fully immune to this leak... it's just a matter of degree. If you are not observing problems with a VM running in Fusion on macOS 10.15.6, it will almost certainly still be leaking memory... just not at a high enough rate to cause a real-world problem.

--

Darius

rossco_au · ‎07-28-2020

Thanks for the info dariusd and all the debugging efforts (and thanks to everyone that has helped contribute to debugging this). Fingers crossed for a fix from Apple soon.

Luke_H · ‎07-29-2020

I've been suffering with this crash for weeks, and am glad I found this info -- at least I know what the cause is! Mac mini 2018 here, 32GB ram, 2 running VMs (both Linux). kalloc.32 grows slowly but steadily, I usually get between 3-5 days of uptime before things start to destabilize.

I usually get these symptoms:

- all apps start quitting

- no new apps will launch

- whatever process that's responsible for keeping me signed into google, icloud etc dies, resulting in popups prompting me to log into those services again

- popup "Unapproved caller - SecurityAgent may only be invoked by Apple software" and "Unrecoverable error - SecurityAgent was unable to create requested mechanism builtin:unlock-keychain" usually appears

- gentle reboot doesn't work- just grey screen/beachball for about 5 min until watchdog timeout KP's the whole thing

I feel like this actually started before 10.15.6 but I could be wrong. Is everyone sure 10.15.5 and earlier don't have this bug?

All

VM Ware Fusion potentially causes macOS 10.15.6 to crash