VMware Horizon Community
cdubz
Enthusiast
Enthusiast

Windows 10 Pool Disconnects and Guess OS Crash

Have a windows 10 linked-clone floating desktop pool that is experiencing frequent disconnects on dell wyse zero clients (p25 and p45). When the disconnect happens the guess OS crashes, resets, and registers the event in vsphere. VM is then placed in an already used state in View.


This is the only error in the PCOIP log I can find.

SERVER :map_agent_to_tera: DISCONNECT_FOR_RECONNECT -> TERA_DISCONNECT_CAUSE_HOST_BROKER_RECONNECT

2016-06-07T07:43:34.220-04:00> LVL:0 RC:   0       MGMT_SESS :Tearing down the session

2016-06-07T07:43:34.220-04:00> LVL:1 RC:   0           VGMAC :Stat frms: R=000000/000000/000000  T=000000/000000/000000 (A/I/O) Loss=0.00%/0.00% (R/T)

2016-06-07T07:43:34.220-04:00> LVL:2 RC:   0          COMMON :TERA_PCOIP: SESSION_EVENT=TERA_MGMT_SYS_SESS_EVENT_RESET, disconnect cause (0x105)

2016-06-07T07:43:34.220-04:00> LVL:2 RC:   0          SERVER :server main: cb_notify_session_status called (mask 0x10) with tera_disconnect_cause (0x105)

Connection Servers are sitting behind a F5 virtual NLB.

Anyone else run into issues with Win10 crashing in a VDI environment.

0 Kudos
39 Replies
eccl1213
Enthusiast
Enthusiast

Memory dumps are already uploaded to my SR.

0 Kudos
cdubz
Enthusiast
Enthusiast

Mine as well

0 Kudos
h3nkY
VMware Employee
VMware Employee

Hi cdubz,

Memory dump you upload looks different with uploaded screenshot.

Memory dump shows that the crash was not triggered by persona driver (vmwvvpfsd.sys) but VMware video driver.

Bugcheck code is different as well.

Screenshot: 1A (MEMORY_MANAGEMENT)

Memory dump: 7E (SYSTEM_THREAD_EXCEPTION_NOT_HANDLED)

0 Kudos
h3nkY
VMware Employee
VMware Employee

Did you use following KB to upload memory dump?

I can't find it in our server.

http://kb.vmware.com/kb/2070100

0 Kudos
eccl1213
Enthusiast
Enthusiast

Hi h3nkY,

If you look at the notes of my SR, you will see several links to the downloads for our memory dumps.  They do indeed show the persona driver as the culprit.

Our dumps were not actually uploaded to the ftp site but are posted to our corp public file sharing which you should be able to download.

But I have just now added them directly to the FTP support site as well.

0 Kudos
cdubz
Enthusiast
Enthusiast

@ h3nky

That most recent memory dump is from yesterday.  This is the new error we are getting now after making the registry modification for CSC disabling. the older mem_mgmt bsod and dumps were uploaded as a big support package to VMware for analysis via FTP.

0 Kudos
eccl1213
Enthusiast
Enthusiast

After much testing we now have a workaround thats been pretty solid for us.  VMWare support was pretty stuck on the whole CSC thing (though we do have it disabled).

I noticed that after login, the appdata\roaming folder was missing alot of stuff.  But the folders\files are in the persona repo. 

I guess that maybe persona and some app are fighting to copy a file down first and persona loses.

So we added the appdata\roaming folder to the Persona Preloaded folders in the GPO.

We are gonna try and narrow it down to a specific folder/app but I'd suspect we will eventually hit a new app or update that triggers it if we don't just preload the whole appdata\roaming.

VMWare support is still investigating but I'm not holding my breath.  So far our workaround hasn't seem to slow down logins too much.

0 Kudos
h3nkY
VMware Employee
VMware Employee

Hi ecc1213,

I have seen your memory dump. It exactly hit the place where I fixed the bug on 7.0.

It should not be BSOD at that place and I found that your Win10 is professional version which is not supported OS.

Please use enterprise version and let me know if you hit the same.

0 Kudos
h3nkY
VMware Employee
VMware Employee

Hi cdubz,

I just found the agent log bundle and BSOD screenshot in your ticket.

The memory.dump which is located in under %SystemRoot% will not be collected by agent log bundle.

0 Kudos
cdubz
Enthusiast
Enthusiast

The memory dumps have been emailed directly to VMWare support versus FTP upload.  Here is what the last dump looks like:

Mini Kernel Dump File: Only registers and stack trace are available

Symbol search path is: srv*

Executable search path is:

Windows 10 Kernel Version 10586 MP (2 procs) Free x64

Product: WinNt, suite: TerminalServer SingleUserTS

Built by: 10586.306.amd64fre.th2_release_sec.160422-1850

Machine Name:

Kernel base = 0xfffff802`1d017000 PsLoadedModuleList = 0xfffff802`1d2f5cd0

Debug session time: Mon Jul 11 12:24:14.205 2016 (UTC - 4:00)

System Uptime: 0 days 4:02:45.085

Loading Kernel Symbols

..

Press ctrl-c (cdb, kd, ntsd) or ctrl-break (windbg) to abort symbol loads that take too long.

Run !sym noisy before .reload to track down problems loading symbols.

.............................................................

................................................................

..........................

Loading User Symbols

Loading unloaded module list

...........

*** WARNING: Unable to verify timestamp for vm3dmp.sys

*** ERROR: Module load completed but symbols could not be loaded for vm3dmp.sys

*******************************************************************************

* *

* Bugcheck Analysis *

* *

*******************************************************************************

Use !analyze -v to get detailed debugging information.

BugCheck 1000007E, {ffffffffc0000005, fffff80098582f54, ffffd0002292a0e8, ffffd00022929900}

Probably caused by : vm3dmp.sys ( vm3dmp+12f54 )

Followup:     MachineOwner

---------

1: kd> !analyze -v

*******************************************************************************

* *

* Bugcheck Analysis *

* *

*******************************************************************************

SYSTEM_THREAD_EXCEPTION_NOT_HANDLED_M (1000007e)

This is a very common bugcheck.  Usually the exception address pinpoints

the driver/function that caused the problem.  Always note this address

as well as the link date of the driver/image that contains this address.

Some common problems are exception code 0x80000003.  This means a hard

coded breakpoint or assertion was hit, but this system was booted

/NODEBUG.  This is not supposed to happen as developers should never have

hardcoded breakpoints in retail code, but ...

If this happens, make sure a debugger gets connected, and the

system is booted /DEBUG.  This will let us see why this breakpoint is

  1. happening.

Arguments:

Arg1: ffffffffc0000005, The exception code that was not handled

Arg2: fffff80098582f54, The address that the exception occurred at

Arg3: ffffd0002292a0e8, Exception Record Address

Arg4: ffffd00022929900, Context Record Address

Debugging Details:

------------------

DUMP_CLASS: 1

DUMP_QUALIFIER: 400

BUILD_VERSION_STRING: 10586.306.amd64fre.th2_release_sec.160422-1850

SYSTEM_MANUFACTURER:  VMware, Inc.

VIRTUAL_MACHINE:  VMware

SYSTEM_PRODUCT_NAME:  VMware Virtual Platform

SYSTEM_VERSION:  None

BIOS_VENDOR:  Phoenix Technologies LTD

BIOS_VERSION:  6.00

BIOS_DATE:  09/21/2015

BASEBOARD_MANUFACTURER:  Intel Corporation

BASEBOARD_PRODUCT:  440BX Desktop Reference Platform

BASEBOARD_VERSION:  None

DUMP_TYPE:  2

DUMP_FILE_ATTRIBUTES: 0x8

  Kernel Generated Triage Dump

BUGCHECK_P1: ffffffffc0000005

BUGCHECK_P2: fffff80098582f54

BUGCHECK_P3: ffffd0002292a0e8

BUGCHECK_P4: ffffd00022929900

EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%p referenced memory at 0x%p. The memory could not be %s.

FAULTING_IP:

vm3dmp+12f54

fffff800`98582f54 488b09 mov     rcx,qword ptr [rcx]

EXCEPTION_RECORD:  ffffd0002292a0e8 -- (.exr 0xffffd0002292a0e8)

ExceptionAddress: fffff80098582f54 (vm3dmp+0x0000000000012f54)

   ExceptionCode: c0000005 (Access violation)

  ExceptionFlags: 00000000

NumberParameters: 2

   Parameter[0]: 0000000000000000

   Parameter[1]: 0000000000000000

Attempt to read from address 0000000000000000

CONTEXT:  ffffd00022929900 -- (.cxr 0xffffd00022929900)

rax=0000000000000000 rbx=ffffe00027d2f000 rcx=0000000000000000

rdx=0000000000000000 rsi=0000000000000001 rdi=0000000000000000

rip=fffff80098582f54 rsp=ffffd0002292a320 rbp=0000000000000000

r8=0000000000000000  r9=0000000000000000 r10=ffffd0010d3e7e20

r11=ffffe00027d2f018 r12=ffffe00027bc6020 r13=0000000000000000

r14=ffffe00027bc2040 r15=0000000000bc2000

iopl=0         nv up ei pl nz na pe nc

cs=0010  ss=0018  ds=002b  es=002b fs=0053  gs=002b efl=00010202

vm3dmp+0x12f54:

fffff800`98582f54 488b09 mov     rcx,qword ptr [rcx] ds:002b:00000000`00000000=????????????????

Resetting default scope

CPU_COUNT: 2

CPU_MHZ: 960

CPU_VENDOR:  GenuineIntel

CPU_FAMILY: 6

CPU_MODEL: 3f

CPU_STEPPING: 0

CPU_MICROCODE: 6,3f,0,0 (F,M,S,R)  SIG: 37'00000000 (cache) 37'00000000 (init)

CUSTOMER_CRASH_COUNT:  1

DEFAULT_BUCKET_ID:  NULL_DEREFERENCE

PROCESS_NAME:  System

CURRENT_IRQL:  0

ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%p referenced memory at 0x%p. The memory could not be %s.

EXCEPTION_CODE_STR:  c0000005

EXCEPTION_PARAMETER1:  0000000000000000

EXCEPTION_PARAMETER2:  0000000000000000

READ_ADDRESS: fffff8021d395520: Unable to get MiVisibleState

0000000000000000

FOLLOWUP_IP:

vm3dmp+12f54

fffff800`98582f54 488b09 mov     rcx,qword ptr [rcx]

BUGCHECK_STR:  AV

ANALYSIS_SESSION_HOST:  S3I-42VTD42

ANALYSIS_SESSION_TIME:  07-11-2016 12:37:35.0031

ANALYSIS_VERSION: 10.0.10586.567 amd64fre

LAST_CONTROL_TRANSFER:  from 0000000000000000 to fffff80098582f54

STACK_TEXT: 

ffffd000`2292a320 00000000`00000000 : fffff780`0000000c 00000000`00000000 fffff800`9857e379 ffffe000`27d2f000 : vm3dmp+0x12f54

THREAD_SHA1_HASH_MOD_FUNC: 65a71e6939b656ad8ce64fc8b8614f344ca68962

THREAD_SHA1_HASH_MOD_FUNC_OFFSET: a3e495be754ce2c0ac6393d2477684f5bdb3b9a6

THREAD_SHA1_HASH_MOD: 65a71e6939b656ad8ce64fc8b8614f344ca68962

FAULT_INSTR_CODE:  b3098b48

SYMBOL_STACK_INDEX:  0

SYMBOL_NAME:  vm3dmp+12f54

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: vm3dmp

IMAGE_NAME:  vm3dmp.sys

DEBUG_FLR_IMAGE_TIMESTAMP:  57112b96

STACK_COMMAND:  .cxr 0xffffd00022929900 ; kb

BUCKET_ID_FUNC_OFFSET:  12f54

FAILURE_BUCKET_ID:  AV_vm3dmp!Unknown_Function

BUCKET_ID:  AV_vm3dmp!Unknown_Function

PRIMARY_PROBLEM_CLASS:  AV_vm3dmp!Unknown_Function

TARGET_TIME:  2016-07-11T16:24:14.000Z

OSBUILD:  10586

OSSERVICEPACK:  0

SERVICEPACK_NUMBER: 0

OS_REVISION: 0

SUITE_MASK:  272

PRODUCT_TYPE:  1

OSPLATFORM_TYPE:  x64

OSNAME:  Windows 10

OSEDITION:  Windows 10 WinNt TerminalServer SingleUserTS

OS_LOCALE: 

USER_LCID:  0

OSBUILD_TIMESTAMP:  2016-04-23 00:04:21

BUILDDATESTAMP_STR:  160422-1850

BUILDLAB_STR:  th2_release_sec

BUILDOSVER_STR: 10.0.10586.306.amd64fre.th2_release_sec.160422-1850

ANALYSIS_SESSION_ELAPSED_TIME: 627

ANALYSIS_SOURCE:  KM

FAILURE_ID_HASH_STRING:  km:av_vm3dmp!unknown_function

FAILURE_ID_HASH:  {b745e4d2-cc07-25ff-6a27-de8bfb8b3c02}

Followup:     MachineOwner

0 Kudos
cdubz
Enthusiast
Enthusiast

We Also just upgraded the VMware Tools to 10.0.9 after that dump upon Supports request yesterday.  We are monitoring for any additional crashes now.  I will say that since the CSC registry edit the crashes have been fewer. Versus getting 4-7 a day its maybe 1-2 sometimes none.

0 Kudos
eccl1213
Enthusiast
Enthusiast

Yes, we first had this issue with our first build which was the ENT edition.  When we first had the bluescreens we assumed it was a bad Windows install and there was some thought one of the programs did not support enterprise.  So we did a new pool on Pro just to test.

But the BSOD was identical between the version.

0 Kudos
h3nkY
VMware Employee
VMware Employee

Hi ecc1213,

Unfortunately, we are still not supporting Win10 Pro yet.

If you face a BSOD with Win10 Enterprise, then you can attach to your support ticket so we can see what is happening.

Thanks.

0 Kudos
eccl1213
Enthusiast
Enthusiast

Understood, the bug still exist in Win10 Enterprise.  We just happen to upload logs from the Win10 Pro pool.  But our ENT pool BSOD at the same place.

We run alot of different apps but we have isolated our issue to Revit 2017.  If you open the program but don't open any drawings your fine.  You can log out and repeat as often as you want.

But if you login, open Revit and open any file (like the sample drawing) and then log out.  The next time you open Revit it will almost instantly BSOD.  100% of the time.  Its easy to replicate.

So far we have 100% success with the "preload" persona option.  So we have a workaround.

Though I suspect other users will continue to hit this bug moving forward.

0 Kudos
cdubz
Enthusiast
Enthusiast

eccl1213‌,  trying out your preloading appdata\roaming persona piece today.  VMware support basically told me they dont know what it is and to contact Microsoft. Not really a helpful answer.

0 Kudos
cdubz
Enthusiast
Enthusiast

So far no crashes within the last 2 working days with the appdata\roaming preload setting enabled via gpo. Still monitoring for a few more days before calling it "good".

0 Kudos
eccl1213
Enthusiast
Enthusiast

My support ticket was just updated that VMWare has now been able to replicate this.  So hopefully a perm fix is soon to follow.

Preloading the profile has worked fine for us and hasn't really slowed down the login much.

0 Kudos
cdubz
Enthusiast
Enthusiast

Good to hear.  We have been running the preload option since last Friday and like you it barely adds any additional time to user login.  We did however have one VM crash the other day, but its been the only one.

0 Kudos
h3nkY
VMware Employee
VMware Employee

I have root caused the issue and plan to add the fix on the next coming release.

The issue occurs on particular applications which looks like invoking asynchronous I/O with cancel.

Please consult to your support if you can get early patch.

Thanks a lot for your report.

0 Kudos
cdubz
Enthusiast
Enthusiast

Good to hear that as well h3nkY‌.  As an update with my case, I have since removed the disposable disk from one of the desktop pools for testing and trying to capture a full memory dump file vs a minidump and have not have a VM in that pool crash yet.  Don't know if that is related to your fix at all or not.  (we are running 7.0.1 view)

0 Kudos