It happened again even though I turned off 3D acceleration as you asked.
I point you to the analysis of kernel allocations in another response, I believe this is the root cause. Now, if it is due to to a "bug" in 10.15.6 or if this was a latent issue that started happening due to another change, who knows.
I suggest you escalate this case within engineering so that you can sort it out with the kernel engineering folks at Apple the zprint repro should be enough information it might be better than crash dumps even?
The windowserver crash looks like this, ie yes it does receive a SIGKILL probably in line with the previous analysis by another poster about kernel allocation issues
Process: WindowServer 
Version: 600.00 (451.4)
Code Type: X86-64 (Native)
Parent Process: launchd 
Responsible: WindowServer 
User ID: 0
Date/Time: 2020-07-26 12:16:30.218 +0200
OS Version: Mac OS X 10.15.6 (19G73)
Report Version: 12
Anonymous UUID: 5643BC9C-6405-FB15-C2CF-FDCCCCFAAFCE
Time Awake Since Boot: 160000 seconds
System Integrity Protection: enabled
Crashed Thread: 0 Dispatch queue: com.apple.main-thread
Exception Type: EXC_CRASH (SIGKILL)
Exception Codes: 0x0000000000000000, 0x0000000000000000
Exception Note: EXC_CORPSE_NOTIFY
Termination Reason: WATCHDOG, [0x1] monitoring timed out for service
Termination Details: WATCHDOG, checkin with service: WindowServer returned not alive with context:
is_alive_func returned unhealthy : WindowServer initialization not complete (post IOKitWaitQuiet)
40 seconds since last successful checkin, 16031 total successsful checkins since load (0 induced crashes)
Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0 libsystem_kernel.dylib 0x00007fff6e797dfa mach_msg_trap + 10
1 libsystem_kernel.dylib 0x00007fff6e798170 mach_msg + 60
2 libdispatch.dylib 0x00007fff6e61190e _dispatch_mach_send_and_wait_for_reply + 632
3 libdispatch.dylib 0x00007fff6e611d4e dispatch_mach_send_with_result_and_wait_for_reply + 50
4 libxpc.dylib 0x00007fff6e899846 xpc_connection_send_message_with_reply_sync + 238
5 com.apple.CoreFoundation 0x00007fff345de083 __78-[CFPrefsPlistSource sendRequestNewDataMessage:toConnection:retryCount:error:]_block_invoke + 22
6 com.apple.CoreFoundation 0x00007fff345e5779 CFPREFERENCES_IS_WAITING_FOR_SYSTEM_CFPREFSD + 74
7 com.apple.CoreFoundation 0x00007fff345ddfdf -[CFPrefsPlistSource sendRequestNewDataMessage:toConnection:retryCount:error:] + 672
8 com.apple.CoreFoundation 0x00007fff345a12b9 -[CFPrefsPlistSource handleErrorReply:fromMessageSettingKeys:toValues:count:retryCount:retryContinuation:] + 810
9 com.apple.CoreFoundation 0x00007fff345a0f86 -[CFPrefsPlistSource handleErrorReply:retryCount:retryContinuation:] + 40
10 com.apple.CoreFoundation 0x00007fff3459fea4 -[CFPrefsPlistSource handleReply:toRequestNewDataMessage:onConnection:retryCount:error:] + 187
11 com.apple.CoreFoundation 0x00007fff345ddf37 -[CFPrefsPlistSource sendRequestNewDataMessage:toConnection:retryCount:error:] + 504
12 com.apple.CoreFoundation 0x00007fff345a12b9 -[CFPrefsPlistSource handleErrorReply:fromMessageSettingKeys:toValues:count:retryCount:retryContinuation:] +
good catch, I believe you have found the root cause... Now why this happens under 10.15.6 is another issue but that will probably take cooperation between VMware engineering and MacOS kernel engineers at Apple.
Given that this problem probably affects a lot of people I would suggest escalating this on monday morning
I have three panic reports. I do not feel comfortable sharing these publicly. I tried to send you a private message, but not sure if it worked. Let me know how I can send the archive to you.
I suggest you escalate this case within engineering
I am in engineering... I've been working mostly on Fusion and macOS internals at VMware for over a decade now. If you have a case going through support it will quite likely end up being assigned to my team to figure out what's going on... and it might even end up assigned to me personally.
Thanks for the information you've all provided so far... it's great. I've got a few leads to investigate...
My apologies, never posted in this forum before, didn't know engineering looked directly at forum traffic
I'll stay on 10.15.6 for now since it only happens once every 24/36 hours for me. Good luck with the debugging
Awesome. I'll set up the debugging environment...
Thanks again for looking into this. To re-iterate, the leak is very visible with "sudo zprint -d", you can see the kalloc.32 zone increase immediately as long as the VM is running. And after a few hours, the logs indicate that the this "kalloc.32" zone has exceeded its maximum size, so the kernel kills almost all processes, and possibly respawns them but in a half uninitialized state? (Which is perhaps why there is a need to re-authenticate iCloud and so forth, it's like these processes are restarted with a blank state - perhaps they are unable to read their own prefs or keys).
A side question:
I went into my time machine backups and fished out /System/Library/Kernels/kernel from a 10.15.5 backup into a copy in a temporary directory for now. If this was Linux, I would simply choose the previous kernel in grub or whatever. Maybe this is too crazy on a mac, but would it be possible to maybe copy this 10.15.5 kernel file into "/System/Library/Kernels/kernel.10-15-5" and boot macOS 10.15.6 with a macOS 10.15.5 kernel? Googling didn't turn up too much, although there were some examples on the web for the kernel debug kit using a series of kextcache prelinking and nvram boot-args tricks. I'd be curious to see if "sudo zprint -d" shows kalloc.32 leaks on the 10.15.5 kernel without having to reinstall and restore the entire macOS installation.
Another side question: Would it be possible to use dtrace to get some (kernel?) stack traces for all the kalloc.32 zone allocations? Or maybe I'll just wait until someone who knows what they are doing - like you - gets to take a closer look. Best of luck & hoping for a fast resolution.
3 people found this helpful
I would not try mixing and matching kernels like that. You might get away with it, but ... yeah, lots could break along the way. Probably not worth the risk. (We have in the past seen customers with broken RAID mirrors who would unintentionally do this from time to time... The system firmware would load the bootloader from the stale mirror, the bootloader would load the kernelcache from the same disk, then the OS would load and find the newer disk, and surprise! the OS had been upgraded since the RAID mirror broke so now we're an old kernel running on a new system... and then things sometimes worked, and sometimes didn't...)
It's not uncommon for resource-starvation issues to cause subsequent corruption... e.g. something manages to read in some prefs or whatever, tries to write out some updates to disk, and ends up writing out a zero-byte file before falling over or getting killed or whatever... In many ways a resource-starvation failure is worse than a hard crash, because at least in a hard crash you just have everything stop all at once in a moment where everything is close to consistent, rather than having the rambling corpses of not-quite-dead processes wreaking havoc on your system in its final moments.
I wouldn't say that I necessarily know what I'm doing, but at least I have a toolkit of strategies which I can try to apply... and then various sizes of hammer...
Just FYI I looked more closely with print and yes, it's the very same situation with me, i.e kalloc.32 sticks out with multiple gigs allocated..
kalloc.16 16 21228K 29927K 1358592 1915344 1322593 4K 256 C
kalloc.32 32 2278736K 2588635K 72919552 82836328 72919465 4K 128 C
kalloc.48 48 13092K 13301K 279296 283754 265929 4K 85 C
1 person found this helpful
com.apple.security.sandbox is allocating millions and millions of blocks of memory containing just the text "/dev". Sigh.
Next step is to do a bit more tracing to see if I can figure out whether they are leaking or accumulating "live", and whether there is anything Fusion is doing which might unnecessarily exacerbate the problem (or that we can tweak to work around the problem).
I removed some ids, but otherwise there are three complete panic reports. The last one does not mention some Intel drivers because I have removed them from the system, thinking it might help, but it did not.
panic.zip 9.1 K
Also I noticed that on Apple forum some people report that they use Virtual Box, which seems to crash the new system even more often. With Fusion, I only observed this when I was away for a few hours. Now I try to quit VMware every night. I will see if this affects it later this week. It does not crash every day. Just FYI, I do not put my mac or hosted Windows into sleep mode.
Last time I saw an issue like this four years ago. Back then I had to disable Power Nap feature and it stopped crashing... until this update. I am guessing Apple introduced some power saving features again.
9 people found this helpful
Thanks for your patience, everyone.
We have narrowed down the problem to a regression in the com.apple.security.sandbox kext (or one of its related components) included in macOS 10.15.6 (19G73), and we have now filed a comprehensive report with Apple including a minimal reproduction case which should allow them to easily identify and address the issue.
We have not yet identified any workaround other than refraining from installing macOS 10.15.6, which is a bit painful (and the advice will come too late for anyone who finds this thread because they are already running into this problem), or shutting down your VMs whenever you aren't using them and rebooting your host every day or every few hours or every hour, which is ... ugggggh.
We'll keep investigating possible workarounds to see if we can pop out a point release with a mitigation... To Be Determined. (In all honesty, it isn't looking good, but we've come up with some mighty creative workarounds in the past, so I'll never say never.)
Thanks again to everyone here for the epic assistance and for being so utterly polite and patient despite the mess. Y'all are the best.
That's great news, thanks for the amazingly quick turnaround in diagnosing this.
I really hope Apple doesn't drag its feet on this leaving 10.15.x stranded without a fix, and us having to wait for Big Sur in the fall
Thanks for the update!
I wonder if Big Sur beta 3 has the same issue as 10.15.6?