ESXI 5.1: Host stops unexpectedly

92656_VM · ‎03-25-2013

Hi everyone,

I recently built a system to run ESXi and host a bunch of virtual servers.

The ESXi host seems to run very well, but about once every 24 hours it appears as if it loses its ability to use the locally attached disks. When the problem occurs, I am still able to ping the host, and to SSH into it, but the moment I do anything that requires disk IO, my SSH session hangs.

I have spent a lot of time trying to make sure it's not a heat or power issue. I have a multimeter with a temperature probe connected to the cooling fins of the RAID controller's main processor and the raid chip's surface never gets above 39 degrees celcius. Everything else in the host is nice and cool and never above 35 degrees celcius surface temperatures, measured using an infrared thermometer while the system is running.

The "hang" seems to happen regardless of the load I put on the host.

A bit of system info:

CPU: Intel i7-3930K

64 GB RAM

Adaptec 6405 RAID controller

4x Western Digital 1 TB drives in a RAID5 array with 2 volumes: 1 for booting ESXi, 1 to host my VM's.

750 watts power supply

Yesterday, I had the system running for a few hours booted from a CD with the Memtest86 tool, I wanted to try and see if I had flaky RAM. The tests ran just fine, and given the symptom (suddenly can't access the RAID controller) i doubt it's a RAM issue.

I grabbed vmkernel.log (file is attached) and I see sporadic messages along the lines of:

2013-03-25T11:48:56.666Z cpu8:34375)ScsiDeviceIO: 2316: Cmd(0x4124007addc0) 0x85, CmdSN 0xe0 from world 5150 to dev "mpx.vmhba3:C0:T1:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2013-03-25T11:48:56.666Z cpu8:34375)ScsiDeviceIO: 2316: Cmd(0x4124007addc0) 0x4d, CmdSN 0xe1 from world 5150 to dev "mpx.vmhba3:C0:T1:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2013-03-25T11:48:56.666Z cpu0:5150)WARNING: ScsiDeviceIO: 6678: IEC page to device "mpx.vmhba3:C0:T0:L0" has bad pagecode: 0x30
2013-03-25T11:48:56.671Z cpu8:32976)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x85 (0x4124007addc0, 5150) to dev "mpx.vmhba3:C0:T0:L0" on path "vmhba3:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE

vmhba3 is my adaptec RAID controller.

I have no idea what this means, or if it's even important, but I thought I'd throw it out there since it seems somehow related to the RAID controller.

My ESXi is using what I believe to be the most recent Adaptec driver.

I have run ESXi for a few yars and it's always been rock solid so I don't know much about how to troubleshoot these types of issues I'm afraid. Any suggestions or advice would be greatly appreciated.

AleShima · ‎04-08-2013

Exactly the same problem here, but it started after a memory upgrade last saturday.

The memory was replaced twice by the datacenter, and now they think that they are not the problem.

I'm not sure if the problem was caused by this upgrade (messing with memory, cables etc) or if it started just because the server was rebooted (just some update that was not completely active yet perhaps?).

The datacenter will replace some SAS cables and maybe the Raid controller tonight.

These are some logs from "dmesg" command in ESXi SSH:

2013-04-09T01:58:33.362Z cpu8:9363)WARNING: ScsiDeviceIO: 6678: IEC page to device "mpx.vmhba0:C0:T1:L0" has bad pagecode: 0x0

2013-04-09T01:58:33.382Z cpu15:8207)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x85 (0x41244039aa40, 9363) to dev "mpx.vmhba0:C0:T1:L0" on path "vmhba0:C0:T1:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE

2013-04-09T01:58:33.382Z cpu15:8207)ScsiDeviceIO: 2316: Cmd(0x41244039aa40) 0x85, CmdSN 0xba from world 9363 to dev "mpx.vmhba0:C0:T1:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

2013-04-09T01:58:33.382Z cpu15:8207)ScsiDeviceIO: 2316: Cmd(0x41244039aa40) 0x4d, CmdSN 0xbb from world 9363 to dev "mpx.vmhba0:C0:T1:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

2013-04-09T01:58:33.382Z cpu8:9363)WARNING: ScsiDeviceIO: 6678: IEC page to device "mpx.vmhba0:C0:T0:L0" has bad pagecode: 0x0

2013-04-09T01:58:33.402Z cpu15:10179)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x85 (0x41244039aa40, 9363) to dev "mpx.vmhba0:C0:T0:L0" on path "vmhba0:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE

2013-04-09T01:58:33.402Z cpu15:10179)ScsiDeviceIO: 2316: Cmd(0x41244039aa40) 0x85, CmdSN 0xbd from world 9363 to dev "mpx.vmhba0:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

2013-04-09T01:58:33.402Z cpu15:10179)ScsiDeviceIO: 2316: Cmd(0x41244039aa40) 0x4d, CmdSN 0xbe from world 9363 to dev "mpx.vmhba0:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

This is the hardware information:

Motherboard SuperMicro X8DTU-F_R2 Intel Xeon HexCore DualProc [2Proc]

Processor Intel Xeon-Westmere 5620-Quadcore [2.4GHz] Hardware upgrade

RAM slot 1 Hynix 8GB DDR3 2Rx4 8GB DDR3 2Rx4 [8GB] Hardware upgrade

RAM slot 2 Hynix 8GB DDR3 2Rx4 8GB DDR3 2Rx4 [8GB] Hardware upgrade

RAM slot 3 Hynix 8GB DDR3 2Rx4 8GB DDR3 2Rx4 [8GB] Hardware upgrade

RAM slot 4 Hynix 8GB DDR3 2Rx4 8GB DDR3 2Rx4 [8GB] Hardware upgrade

RAM slot 5 Hynix 8GB DDR3 2Rx4 8GB DDR3 2Rx4 [8GB] Hardware upgrade

RAM slot 6 Hynix 8GB DDR3 2Rx4 8GB DDR3 2Rx4 [8GB] Hardware upgrade

Drive Controller 0b321170a3f Adaptec \ 5805 Z \ SATA/SAS RAID Hardware upgrade

Battery Adaptec Super Capacitor ZMM-100CC

Hard Drive 1 6sl0kn81 Seagate Cheetah ST3600057SS [600GB] Hardware upgrade

Hard Drive 2 6sl0k531 Seagate Cheetah ST3600057SS [600GB] Hardware upgrade

Hard Drive 3 6sl0k58t Seagate Cheetah ST3600057SS [600GB] Hardware upgrade

Hard Drive 4 z291byjb Seagate ConstellationES.2 ST33000650NS [3000GB] Hardware upgrade

Remote Mgmt Card SuperMicro Winbond WPCM450 - Onboard IPMI-KVM

Network Card SuperMicro AOC-PG-i2+ SuperMicro Gigabit Port

Power Supply SuperMicro PWS-561-1H20 (20) R6.1 560W

Backplane SuperMicro BPN-SAS-815TQ 4 Port Passive

The server was running ESXi 5/5.1 for almost a year without problems before this memory upgrade.

AleShima · ‎04-09-2013

Almost everything changed by the Datacenter: RAID Controller, Cables, Memory (again) - same problem.

I asked for another server, with exactly the same configuration, and after a fresh ESXi 5.1 install the problem was there again:

2013-04-09T17:49:45.346Z cpu15:9287)WARNING: ScsiDeviceIO: 6678: IEC page to device "mpx.vmhba2:C0:T1:L0" has bad pagecode: 0x0

2013-04-09T17:49:45.347Z cpu10:8202)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x85 (0x4124403ce940, 9287) to dev "mpx.vmhba2:C0:T1:L0" on path "vmhba2:C0:T1:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE

2013-04-09T17:49:45.347Z cpu10:8202)ScsiDeviceIO: 2316: Cmd(0x4124403ce940) 0x85, CmdSN 0x6 from world 9287 to dev "mpx.vmhba2:C0:T1:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

2013-04-09T17:49:45.347Z cpu10:8604)ScsiDeviceIO: 2316: Cmd(0x4124403ce940) 0x4d, CmdSN 0x7 from world 9287 to dev "mpx.vmhba2:C0:T1:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

2013-04-09T17:49:45.348Z cpu15:9287)WARNING: ScsiDeviceIO: 6678: IEC page to device "mpx.vmhba2:C0:T0:L0" has bad pagecode: 0x0

2013-04-09T17:49:45.349Z cpu10:8604)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x85 (0x4124403ce940, 9287) to dev "mpx.vmhba2:C0:T0:L0" on path "vmhba2:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE

2013-04-09T17:49:45.349Z cpu10:8604)ScsiDeviceIO: 2316: Cmd(0x4124403ce940) 0x85, CmdSN 0x9 from world 9287 to dev "mpx.vmhba2:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

2013-04-09T17:49:45.349Z cpu10:8604)ScsiDeviceIO: 2316: Cmd(0x4124403ce940) 0x4d, CmdSN 0xa from world 9287 to dev "mpx.vmhba2:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

I have another server with almost the same configuration, but with ESXi 5.0 and 24 GB of RAM.

I'll try to install ESXi 5.0 to see if the problem happens in this version too.

Ps: I know this server is not in the HCL, but it was working for months without problems. There is a configuration I haven't tried yet: ESXi 5.1 and back to 24 GB of RAM.

Alexandre Wendt Shima

92656_VM · ‎04-10-2013

I ended up upgrading the firmware on my RAID controller and replacing the fan-out cable. Since then. it's been rock solid stable

rrich · ‎05-28-2013

Seeing exact same issue here, also using Adaptec controller (don't have model number handy). Will look at updating firmware, thanks for posting your results!

AleShima · ‎05-28-2013

Hello,

The problem is indeed related to Adaptec and ESXi 5.1.

The same server with version 5.0 works fine.

Alexandre

rrich · ‎06-03-2013

I just wanted to provide an update. We have two ESXi hosts running Adaptec 2405 cards and saw this issue on both of them. On Wednesday of last week I updated the one to the latest firmware, and the other was found to already be running the latest firmware. Since the update, they were running fine until yesterday at 1PM. We had a Veeam job replicating guests from one to the other, and both hosts shut down within minutes of each other. One of the systems restarted, and the last messages in vmkernel.log were of the same type that 92656_VM posted above. The other machine is still down and will be booted later today.

I don't really have any specific proof that it's related to the controller cards/drivers, but it certainly seems to be that way at the moment. They had been running ESXi 4 stably for ~2 years prior to the upgrade to 5.1, so the cards are probably fine...

Cory_S78 · ‎07-29-2013

I have a Supermicro server running ESXi5.1 that recently started having the same issues... my HP P410 SAS controller died and I replaced it with an Adaptec 3405, recreated the RAID array and reloaded all my VMs, within hours I was having errors such as 'WARNING: ScsiDeviceIO: 6693: IEC page to device "mpx.vmhba2:C0:T0:L0" has bad pagecode: 0x0' and i would have VMs lock up intermittently, server performance monitoring would so IO to the disk drop to zero, then if I left it alone it would come back after a few minutes. Attempting to shut down VMs, log in, etc would result in a host hang most the time.

I checked the HCL and my Adaptec controller is listed and the firmware and driver are correct: VMware Compatibility Guide: I/O Device Search

Any ideas? I'd hate to have to replace my RAID controller again so soon...

EDIT: I wanted to post an update in case anyone else finds this useful. Adaptec is not supporting some of the older cards on ESXi 5.1, per their site:

http://ask.adaptec.com/app/answers/detail/a_id/17086/~/adaptec-raid-controllers-and-vmware-support

Seems like only the series 7 cards are supported as of 5.1

samdeng · ‎04-23-2014

Hi , i have the same problem. i have disable write-cache on afternoon. Also in the observation. DOS YOU ANY suggestions ?

vmware esxi5.5u1

supermicro X9DRL-3F BIOS newest.

Controller 1	Adaptec 6805
Driver version	1.2-1 (40301)
Firmware version	5.2-0 (19144)
BIOS version	5.2-0 (19144)

AleShima · ‎04-23-2014

Sorry to hear, my solution was not upgrade to 5.1

samdeng · ‎04-23-2014

HI AleShima,

i have sent a email to adaptec for get a support about this problem in yesterday. But so far, i have not get any response.

All

ESXI 5.1: Host stops unexpectedly