VMware Cloud Community
esnmb
Enthusiast
Enthusiast

x64 Windows 2003 VM's seem to crash a lot

Anyone else notice this trend?

0 Kudos
13 Replies
fjacobs_vm
VMware Employee
VMware Employee

What service pack of Windows 2003 are you using, and what additional Microsoft hotfixes, if any, have been applied to the VMs which are crashing?

Are there any patterns in the crash output (common STOP codes across BSODs)?

What sorts of applications are running in the VMs when the crashes occur? Are these applications 32-bit or 64-bit?

0 Kudos
esnmb
Enthusiast
Enthusiast

It is running service pack 2 fully patched. The server is running SAP and SQL 2005 (service pack 2), both apps are 64-bit. Here is the crash dump.

*******************************************************************************

  • *

  • Bugcheck Analysis *

  • *

*******************************************************************************

KMODE_EXCEPTION_NOT_HANDLED (1e)

This is a very common bugcheck. Usually the exception address pinpoints

the driver/function that caused the problem. Always note this address

as well as the link date of the driver/image that contains this address.

Arguments:

Arg1: ffffffffc000001d, The exception code that was not handled

Arg2: fffffadfa01c0cd8, The address that the exception occurred at

Arg3: 0000000000000002, Parameter 0 of the exception

Arg4: 0000000000000000, Parameter 1 of the exception

Debugging Details:

-


PEB is paged out (Peb.Ldr = 00000000`7efdf018). Type ".hh dbgerr001" for details

PEB is paged out (Peb.Ldr = 00000000`7efdf018). Type ".hh dbgerr001" for details

EXCEPTION_CODE: (NTSTATUS) 0xc000001d - Illegal Instruction An attempt was made to execute an illegal instruction.

FAULTING_IP:

+fffffadfa01c0cd8

Page 95c0 not present in the dump file. Type ".hh dbgerr004" for details

Page 95c0 not present in the dump file. Type ".hh dbgerr004" for details

fffffadf`a01c0cd8 ?? ???

EXCEPTION_PARAMETER1: 0000000000000002

EXCEPTION_PARAMETER2: 0000000000000000

DEFAULT_BUCKET_ID: DRIVER_FAULT

BUGCHECK_STR: 0x1E

PROCESS_NAME: VMwareTray.exe

CURRENT_IRQL: 2

EXCEPTION_RECORD: fffff800003cdd10 -- (.exr 0xfffff800003cdd10)

ExceptionAddress: fffffadfa01c0cd8

ExceptionCode: c000001d (Illegal instruction)

ExceptionFlags: 00000000

NumberParameters: 0

TRAP_FRAME: fffff800003cdda0 -- (.trap 0xfffff800003cdda0)

NOTE: The trap frame does not contain all registers.

Some register values may be zeroed or incorrect.

rax=0000000000000004 rbx=fffffadf96467620 rcx=0000000000000201

rdx=0000000000000002 rsi=fffffadfa2cbf7ff rdi=fffffadfa0abfa20

rip=fffffadfa01c0cd8 rsp=fffff800003cdf38 rbp=fffffadf937e7cf0

r8=00000000000c002f r9=fffffadfa2c95040 r10=0000000000000003

r11=0000000000000000 r12=0000000000000000 r13=0000000000000000

r14=0000000000000000 r15=0000000000000000

iopl=0 nv up ei pl nz na pe nc

Page 95c0 not present in the dump file. Type ".hh dbgerr004" for details

Page 95c0 not present in the dump file. Type ".hh dbgerr004" for details

fffffadf`a01c0cd8 ?? ???

Resetting default scope

LAST_CONTROL_TRANSFER: from fffff80001080da6 to fffff8000102e7d0

FAILED_INSTRUCTION_ADDRESS:

+fffffadfa01c0cd8

Page 95c0 not present in the dump file. Type ".hh dbgerr004" for details

Page 95c0 not present in the dump file. Type ".hh dbgerr004" for details

fffffadf`a01c0cd8 ?? ???

STACK_TEXT:

fffff800`003cd618 fffff800`01080da6 : 00000000`0000001e ffffffff`c000001d fffffadf`a01c0cd8 00000000`00000002 : nt!KeBugCheckEx

fffff800`003cd620 fffff800`0102e5ef : fffff800`003cdd10 fffffadf`a01c0c10 fffff800`003cdda0 00000000`00000001 : nt!KiDispatchException+0x128

fffff800`003cdc20 fffff800`0102cb83 : fffff800`003cdda0 00000000`00000002 00000000`00000000 fffffadf`00000000 : nt!KiExceptionExit

fffff800`003cdda0 fffffadf`a01c0cd8 : fffffadf`a2cbf844 00000000`7efdb000 00000000`92175b46 00000000`00000000 : nt!KiInvalidOpcodeFault+0xc3

fffff800`003cdf38 fffffadf`a2cbf844 : 00000000`7efdb000 00000000`92175b46 00000000`00000000 fffffadf`937e7cf0 : 0xfffffadf`a01c0cd8

fffff800`003cdf40 00000000`7efdb000 : 00000000`92175b46 00000000`00000000 fffffadf`937e7cf0 00000000`00000246 : 0xfffffadf`a2cbf844

fffff800`003cdf48 00000000`92175b46 : 00000000`00000000 fffffadf`937e7cf0 00000000`00000246 fffff800`010284e1 : 0x7efdb000

fffff800`003cdf50 00000000`00000000 : fffffadf`937e7cf0 00000000`00000246 fffff800`010284e1 00000000`00000246 : 0x92175b46

fffff800`003cdf58 fffffadf`937e7cf0 : 00000000`00000246 fffff800`010284e1 00000000`00000246 fffffadf`a2a14060 : 0x0

fffff800`003cdf60 00000000`00000246 : fffff800`010284e1 00000000`00000246 fffffadf`a2a14060 00000000`01187806 : 0xfffffadf`937e7cf0

fffff800`003cdf68 fffff800`010284e1 : 00000000`00000246 fffffadf`a2a14060 00000000`01187806 fffffadf`a1fe6748 : 0x246

fffff800`003cdf70 fffff800`0103109f : fffff800`011b0180 fffff800`011b0180 fffffadf`937e7cf0 fffffadf`a2a199d0 : nt!KiRetireDpcList+0x150

fffff800`003ce000 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiDispatchInterrupt+0x4f

STACK_COMMAND: kb

FOLLOWUP_IP:

nt!KiDispatchException+128

fffff800`01080da6 cc int 3

SYMBOL_STACK_INDEX: 1

SYMBOL_NAME: nt!KiDispatchException+128

FOLLOWUP_NAME: MachineOwner

MODULE_NAME: nt

IMAGE_NAME: ntkrnlmp.exe

DEBUG_FLR_IMAGE_TIMESTAMP: 46237547

FAILURE_BUCKET_ID: X64_0x1E_BAD_IP_nt!KiDispatchException+128

BUCKET_ID: X64_0x1E_BAD_IP_nt!KiDispatchException+128

Followup: MachineOwner

-


0: kd> .exr 0xfffff800003cdd10

ExceptionAddress: fffffadfa01c0cd8

ExceptionCode: c000001d (Illegal instruction)

ExceptionFlags: 00000000

NumberParameters: 0

0 Kudos
pharist
Contributor
Contributor

Yes, we are seeing the same thing. ESX 3.0.2, Windows 2003 x64 SP2, SAP ECC 5.0 and Oracle 10.0.2, both 64 bit. Just in the last couple months have we seen this instability - before that we were rock solid for a good year. I am thinking ESX updates or Windows updates or some combination thereof (happenend around the same time) but not sure yet. What hardware are you running on? We are running on HP BL685c with qLogic HBAs, EMC CLARiiON CX3-40 on the back end w/Cisco 9134 FC switches.

I'm going to try to dig into this today and see what I can find as the crashes seem to be getting more frequent. I am suspecting something at the storage level - I think I remember one of the ESX updates addressed a storage driver.

I'll post if I find anything. Otherwise I plan to update firmware on the servers and apply all the latest ESX patches and cross fingers...

esnmb
Enthusiast
Enthusiast

I'm glad it's not just me noticing it then.

Keep me posted!

BaldwinM@alxn.com

0 Kudos
pharist
Contributor
Contributor

All signs point to KB932596, a security update. We installed this in our environment on 3/28/08, so the timing fits for when we first started to notice problems. From that KB:

Known issues with this update

After you install this update on a computer that is running an x64-based version of Windows Server 2003, of Windows Vista, or of Windows Server 2008, the computer may randomly restart, and then you may receive a Stop error message. The Stop error code may be 0x0000001E, 0x000000D1, or another Stop error code.

To resolve this problem, install hotfix 950772.

For more information, click the following article number to view the article in the Microsoft Knowledge Base: 950772 (http://support.microsoft.com/kb/950772/) A computer that is running an x64-based version of Windows Server 2003, of Windows Vista, or of Windows Server 2008 randomly restarts and then generates a Stop error

The symptoms fit exactly as we have seen both 0x0000001E and 0x00000D1 and crashes seem to be random as we have not been able to tie it to any specific application activity. I will be installing hotfix 950772 this weekend. I will give it a couple weeks and if we have no crashes (we have 10 2003 x64 VMs and from recent history we would definitely see one somewhere in 2 weeks) I will consider it a success and will post again. Actually, I'll post either way but let's hope for good news.

0 Kudos
jfierberg
Contributor
Contributor

We've had to go into the BIOS of our PowerEdge Servers and enable the x64 ability. Has this already been done on your hardware?

0 Kudos
esnmb
Enthusiast
Enthusiast

That is a great find! I'm downloading it now and will install on our test environement. Thanks!

0 Kudos
esnmb
Enthusiast
Enthusiast

I believe so. I actually don't remember seeing the setting in our IBM LS41 blades. I'll have to check the next chance I get though.

0 Kudos
pharist
Contributor
Contributor

Installed in 2 sandbox systems without incident so we'll see what happens - if they stay solid throughout the weekend I will apply to our DEV and QA systems, although I'll probably hold off on PRD for a week to make sure there's no new problems introduced.

esnmb
Enthusiast
Enthusiast

Good idea.

0 Kudos
pharist
Contributor
Contributor

Confirmation that MS hotfix 950772 has resolved the problems in our environment. It's been 3+ weeks now in all of our x64 systems (10 VMs) and no crashes. We were having a couple systems crash per week prior to applying the hotfix.

0 Kudos
esnmb
Enthusiast
Enthusiast

Same here. Thanks all!

0 Kudos
denalitom
Contributor
Contributor

Is this just happening on AMD procs? We're seeing the same issue with Windows 2008 64bit (running AMD Opterons) and according to the KB article and a phone call to MS there is no patch for the 2008 x64OS (yet). Anyone having these crashes on Intel procs or have a recommendation/solution for Win2008 64bit systems?

Thanks, tom

0 Kudos