Re: Purple Screen of Death related to RDM drops?

DataRecLab · ‎08-20-2012

I have an ESXi 5u1 (768111) running on a Dell R815 with 4 - 8 core AMD Opteron 6136 and 64gb of RAM. PERC H700 on board as well as PERC 5/E PCI-

E attached to two Dell MD1000 Arrays. Only running a handful of guests, mostly Windows based. I setup one of the guests (Win 7 X64) with a manual RDM link (creating using David Warburton's method) to two 11tb NTFS RAID 5 arrays that were pre-existing on the MD1000's. Data I/O to the arrays is mostly single machines on the internal network writing large (2gb +/-) files.

Everything has been running without fail for the last 90 days of testing. One afternoon last week, during a routine copy process, the Win 7 guest's shares go missing from the network. Come to find that the guest has completely locked up, even guest console is unresponsive (though ALL OTHER guests, are functioning normally). Powered down the guest - get's to 95% and stays there. Went into CLI, SSH, every trick and technique possible to attempt to kill the VM without rebooting the whole server. Finally down all other guests (successfully, without issue), but the Win 7 guest will not kill. Execute a command line shutdown of ESXi, says it successfully completed, but the console still shows as though ESXi is running, and physical console is unresponsive to commands. Warm reboot was required.

Upon reboot, we check the logs, and find some latency issues with the MD1000's, and what appears to be some kind of HBA related errors (below). All guests come back online with no errors. I decided to create a new guest, using Win 2008 R2, and connect the RDM's to that guest, to rule out issues with OS or guest. There are two logical volumes on the HBA. We test by copying the same dataset (about 25 x 2gb files) to the same volume, and get the same result, OS locks up, ESXi cannot kill guest, warm reboot required. We bring everything back up, start copying the same dataset to the second volume and it completes without failure. So I started an integrity check of the first volume, to rule out any disk errors - no problems there. I restart the remaining Guest systems (all of which read and write random small chunks through the Guest with the RDM to the Array) and watch the system for a couple of hours, no crashes, no logged errors. That was Friday night (three days ago). Early this morning (around 55 hours after the last interaction), I come in to find a PSoD (Purple Screen of Death) - below and attached. I repeated the same troubleshooting, including ensuring that all the latest Firmwares, and drivers were in place, for the server, the HBA's and the hard drives.

So my big question is, the PSoD seems to indicate that CPUs 23 and 24 are having issues - but does it seem likely that it would be related to the RDM'd arrays? If there's a problem in the array enviornment, why would it have functioned without fail for almost 90 days? There's been no change to hardware, or software that precipitated the RDM failure, let alone the PSoD. There's been no weather or enviornmental issues. Any ideas, suggestions?

2012-08-20T20:03:51.653Z cpu22:8381)megasas: moving cmd[3]:0x410020c59310:0:0x4124403d8f80 on the defer queue as internal reset in progress.

2012-08-20T20:03:51.653Z cpu22:8381)megasas: moving cmd[4]:0x410020c34b10:0:0x412440518ac0 on the defer queue as internal reset in progress.

2012-08-20T20:03:51.653Z cpu22:8381)megasas: moving cmd[5]:0x410020cd8a50:0:0x4124403cbe80 on the defer queue as internal reset in progress.

2012-08-20T20:03:51.653Z cpu22:8381)megasas: moving cmd[6]:0x410020ce42e0:0:0x4124403e5840 on the defer queue as internal reset in progress.

2012-08-20T20:03:51.653Z cpu22:8381)megasas: moving cmd[7]:0x410020cecae0:1:0x0 on the defer queue as internal reset in progress.

2012-08-20T20:03:51.653Z cpu22:8381)megasas: moving cmd[8]:0x410020cf9370:0:0x4124403d8d80 on the defer queue as internal reset in progress.

2012-08-20T20:03:51.653Z cpu22:8381)megasas: waiting_for_outstanding: after issue OCR.

2012-08-20T20:03:51.653Z cpu22:8381)<5>megasas: reset successful

2012-08-20T20:03:51.653Z cpu0:8896)<7>megaraid_sas: process_fw_state_change_wq: instance addr: 0x0x410019012370, adprecovery: 0x1

2012-08-20T20:03:51.653Z cpu0:8896)megaraid_sas: FW detected to be in fault state, restarting it...

2012-08-20T20:03:51.653Z cpu22:8381)VSCSI: 2637: handle 8193(vscsi0:1):Reset [Retries: 1/0]

2012-08-20T20:03:51.653Z cpu22:8381)WARNING: NMP: nmpDeviceTaskMgmt:2210:Attempt to issue lun reset on device naa.60013720572d3e001246035cd7c09eb5. This will clear any SCSI-2 reservations on the device.

2012-08-20T20:03:51.653Z cpu22:8381)<5>0 :: megasas: RESET -844536 cmd=0 retries=0

2012-08-20T20:03:51.653Z cpu22:8381)megasas: HBA reset handler invoked while adapter internal reset in progress, wait till that's over...

2012-08-20T20:03:52.537Z cpu19:8382)VSCSI: 2763: Retry 1 on handle 8193 still in progress after 153 seconds

2012-08-20T20:03:54.687Z cpu16:8896)pcidata = 30400

2012-08-20T20:03:54.687Z cpu16:8896)megaraid_sas: FW was restarted successfully, initiating next stage...

2012-08-20T20:03:54.687Z cpu16:8896)megaraid_sas: HBA recovery state machine, state 2 starting...

2012-08-20T20:04:24.699Z cpu16:8896)<6>megasas: Waiting for FW to come to ready state

2012-08-20T20:04:52.656Z cpu8:8382)VSCSI: 2763: Retry 1 on handle 8193 still in progress after 213 seconds

2012-08-20T20:05:52.776Z cpu16:8382)VSCSI: 2763: Retry 1 on handle 8193 still in progress after 273 seconds

2012-08-20T20:06:52.896Z cpu3:8382)VSCSI: 2763: Retry 1 on handle 8193 still in progress after 333 seconds

2012-08-20T20:06:53.792Z cpu25:8381)megasas: HBA reset handler timedout for internal reset. Stopping the HBA.

2012-08-20T20:06:53.792Z cpu25:8381)<3>megasas: failed to do reset

2012-08-20T20:06:53.793Z cpu25:8381)VSCSI: 2637: handle 8193(vscsi0:1):Reset [Retries: 2/0]

2012-08-20T20:06:53.793Z cpu25:8381)WARNING: NMP: nmpDeviceTaskMgmt:2210:Attempt to issue lun reset on device naa.60013720572d3e001246035cd7c09eb5. This will clear any SCSI-2 reservations on the device.

2012-08-20T20:06:53.793Z cpu25:8381)<5>0 :: megasas: RESET -845511 cmd=0 retries=0

2012-08-20T20:06:53.793Z cpu25:8381)<3>megasas: cannot recover from previous reset failures

2012-08-20T20:06:58.346Z cpu17:12042)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1055 Host Busy vmhba2:0:32:0 (driver name: LSI Logic SAS based MegaRAID driver) - Message repeated 1 time

2012-08-20T20:06:58.656Z cpu8:11055)VMW_SATP_LOCAL: satp_local_updatePathStates:439: Failed to update path "vmhba2:C2:T1:L0" state. Status=Transient storage condition, suggest retry

2012-08-20T20:06:58.658Z cpu17:8404)VMW_SATP_LOCAL: satp_local_updatePathStates:439: Failed to update path "vmhba2:C0:T16:L0" state. Status=Transient storage condition, suggest retry

2012-08-20T20:06:58.658Z cpu16:12042)VMW_SATP_LOCAL: satp_local_updatePathStates:439: Failed to update path "vmhba2:C0:T32:L0" state. Status=Transient storage condition, suggest retry

2012-08-20T20:06:58.658Z cpu25:8410)VMW_SATP_LOCAL: satp_local_updatePathStates:439: Failed to update path "vmhba2:C2:T2:L0" state. Status=Transient storage condition, suggest retry

2012-08-20T20:06:58.658Z cpu14:12225)VMW_SATP_LOCAL: satp_local_updatePathStates:439: Failed to update path "vmhba2:C2:T0:L0" state. Status=Transient storage condition, suggest retry

2012-08-20T20:07:23.958Z cpu25:8381)VSCSI: 2637: handle 8193(vscsi0:1):Reset [Retries: 3/0]

2012-08-20T20:07:23.958Z cpu25:8381)WARNING: NMP: nmpDeviceTaskMgmt:2210:Attempt to issue lun reset on device naa.60013720572d3e001246035cd7c09eb5. This will clear any SCSI-2 reservations on the device.

2012-08-20T20:07:23.958Z cpu25:8381)<5>0 :: megasas: RESET -845767 cmd=0 retries=0

2012-08-20T20:07:23.958Z cpu25:8381)<3>megasas: cannot recover from previous reset failures

2012-08-20T20:07:54.016Z cpu25:8381)VSCSI: 2637: handle 8193(vscsi0:1):Reset [Retries: 4/0]

2012-08-20T20:07:54.016Z cpu25:8381)WARNING: NMP: nmpDeviceTaskMgmt:2210:Attempt to issue lun reset on device naa.60013720572d3e001246035cd7c09eb5. This will clear any SCSI-2 reservations on the device.

2012-08-20T20:07:54.016Z cpu25:8381)<5>0 :: megasas: RESET -845866 cmd=0 retries=0

2012-08-20T20:07:54.016Z cpu25:8381)<3>megasas: cannot recover from previous reset failures

2012-08-20T20:08:24.075Z cpu25:8381)VSCSI: 2637: handle 8193(vscsi0:1):Reset [Retries: 5/0]

2012-08-20T20:08:24.075Z cpu25:8381)WARNING: NMP: nmpDeviceTaskMgmt:2210:Attempt to issue lun reset on device naa.60013720572d3e001246035cd7c09eb5. This will clear any SCSI-2 reservations on the device.

2012-08-20T20:08:24.075Z cpu25:8381)<5>0 :: megasas: RESET -846029 cmd=0 retries=0

ScottBerger · ‎08-20-2012

Scott Berger is no longer employed with Scentsy. Any Scentsy related issues please contact Corporate 877-895-4160.

Thank you,

System Administration

DataRecLab · ‎08-22-2012

It's nice to know that Scott is no longer with Scentsy... But who is he and why do I care?

Well, either I have commited some faux-pa, or no-one has any answers?

mcowger · ‎08-22-2012

Its just an inidication that he was subscribed to this forum. I'll fix it.

To respond to your question - I dont think they are releted.

The PSOD points at a problem with the CPU TLB - so I'd suspect issues with the CPU first.

--Matt VCDX #52 blog.cowger.us

sparrowangelste · ‎08-22-2012

probably bad cpu or mobo.

--------------------- Sparrowangelstechnology : Vmware lover http://sparrowangelstechnology.blogspot.com

DataRecLab · ‎11-05-2012

Late followup - but the following addresses each issue, which apears to have been independent...

The PSOD was being caused by C-States which were reset in the BIOS during a BIOS update. Processor C-States must be turned off in the BIOS for ESXi to function properly.

The RDM drops were being caused by a single volume on the MD1000 array which had write-through caching enabled, and under heavy load would not be able to keep up. Ideally, the PERC 5/e controller should not be used with ESXi 5.+ as the controller firmware is depricated.