ESXi 5 I/O Errors (HP P2000 G3 SAS)

SKIRK505 · ‎10-27-2011

I have a support case open for this, but I wanted to throw this out there and see if anyone else has come across anything like this.

This is my build

VMware Version: ESXi 5.0.0, 504890
Host Hardware: Cisco UCS C210 M2 (R210-2121605W)
- Firmware: 1.4 (2)
- Proc: Intel Xeon X5650 @ 2.67GHz (x2)
- Memory: 98GB
- PCIe SAS Controller: LSI MegaRaid SAS 9280-414e
SAN: HP P2000 G3 SAS
- Firmware: TS230P03
- Disk Slots: 24x 300GB
- Vdisk1: 1498.4GB (6x disk RAID5)
- Vdisk2: 1498.4GB (6x disk RAID5)
- Vdisk3: 1498.4GB (6x disk RAID5)
- Vdisk4: 599.4GB (6x disk RAID10)

The issue that we are currently seeing is with I/O. Guests performance will degrade to the point that they stop responding. The Hosts will then in turn degrade until they too stop responding, and we are forced to physically power down the server.

Once everything comes back up, things start working normally again. But inevitably the cycle will start all over again hours or days later.

While monitoring the system, after contacting support, we saw continous rests to the local MegaRaid SAS storage device.

So far we have tested the following..

Upgrade the Megaraid SAS Controller Drivers from 4.32 (default drivers installed by ESX) to 5.34
- After the host rebooted it was completely unable to connect to the SAN so we rolled it back to a previous install
Checked Caching
- Since we don't have Battery Backup Units installed on the SAS Controllers and we have Caching disabled in the Windows Guest OS, we are leaving all caching up to the SAN.
Disabled HardwareAcceleratedLocking
- Advanced Settings > VMFS3 > HardwareAdvancedLocking set to 0

After reviewing the HP notes on the P2000 compatibility with ESXi 4.1, I'm wondering if there are settings in the SAN we need to change, or if there was something else within ESXi5, that we need to look at.

At this point if we are also to the point where we will need to take one host down and reload it with ESX 4.1 to rule out this being an ESXi 5 issue. Before we go through that though, I just wanted to see if there was any feedback form the Community or any recommendations. We are trying to avoid having to bring down a production environment and retrograde the Hosts, VM versions, etc..

I appreciate any comments

SKIRK505 · ‎11-29-2011

Still looking for a resolution to this. So far the only recommendation from Vmware Support, is to retro-grade the environment to ESXi 4.1.

AndreyM · ‎12-01-2011

Hi! we has same issue.

also with 9280-4i4e

but our system never works more than 12 hours.......

errors in log the same as yours.

How you downgrade lsi drivers to 4.32 ? as esx 5.0 in base has 5.34 ?!

We try different lsi firmwares as we think that trouble in LSI 9280 but still no luck...

I have no ideas what else to try as similar system with 3ware 9750(based on same chip as 9280 lsi2108) works perfectly in esx 5.0.

One point which we still not tested is upgrade storage to VMFS5. Have you tried this?

SKIRK505 · ‎12-05-2011

The 5.34 LSI driver was recomended by Vmware Support but as an async driver, not in box. But after installing it, lost all connectivity to the SAN, so we rolled the host back to its previous state, which was using 4.32 driver.

All of our datastores are VMFS5 currently.

Right now the system runs find for upwares of 3-5 days with no indication of a problem. But without warning it start acting up and then quickly, with in 30 minutes, become completely unresponsive. One of other thing of note is that is doesn't allways effect all of the hosts. Sometimes, more offten then not, its only one host that is effected, about 75% of the time.

spravtek · ‎12-05-2011

Just as a thought, what about using another (low cost) SAN until the problems are solved, or at least test. Or another controller as that seems to give the most problems, no?

Maybe you can borrow one somewhere to test and see if it has the same issues? Sometimes vendors do this.

It will cost a lot of time to retro-grade to 4.1 ... Maybe it will cost less (time) using another SAN/New controller for the time being?

Just thinking out loud.

SKIRK505 · ‎12-05-2011

We thought about that. We don't have access any other storage, but we try different HBAs (HP SC08E). HP acctually recommended them over the ones we have currently installed. But after switching them out, we couldn't keep the Hosts up more than 3 -6 hours before they went down. We switched the old HBAs back in and now we can keep the hosts up for days rather than hours.

The attachments shows the type of errors we started getting after installed the HP SC08E HBAs.

ranjitcool · ‎12-05-2011

SKIRK,

If I were you, I wouldn't roll with esxi5.0 for a new environment unless update 1 is here already.

Unless you designed your enivronment to use datastore clusters, storage profiles and hypervisor pre deploy, I would stick with esx or esxi 4.2 until esxi5.0 has all its driver issues resolved.

Not really helping here, but my 2 cents.

RJ

Please award points if you find my answers helpful Thanks RJ Visit www.rjapproves.com

SKIRK505 · ‎12-08-2011

I think we may have found our problem. We haven't made the change yet to verify it, but while going through the MSA P2000 statistics, we believe that the SAS cabling between the SAN and the 3 ESXi Hosts is wrong. We had taken it for granted that the everything with the hardware setup was done correctly by the Integrator/Reseller, but after reviewing HP Documentation, the cabling is wrong. This would explain just about all of the issues we are seeing. I will post again once we have corrected the cabling.

JohnTatumRVA · ‎12-13-2011

SKIRK505 wrote:
I think we may have found our problem. We haven't made the change yet to verify it, but while going through the MSA P2000 statistics, we believe that the SAS cabling between the SAN and the 3 ESXi Hosts is wrong. We had taken it for granted that the everything with the hardware setup was done correctly by the Integrator/Reseller, but after reviewing HP Documentation, the cabling is wrong. This would explain just about all of the issues we are seeing. I will post again once we have corrected the cabling.

Did that fix it?

SKIRK505 · ‎12-13-2011

Unfortunately no. Per HP Support we replaced the LSI MegaRaid SAS 9280-414e HBA's (we had installed) with HP SC08E HBA's (which we found out where the only HBA's supported in conjunction with the P2000 G3 SAS MSA). We cabled everything per their directions, and upgraded to the newest firmware for the P2000 MSA, but instead of the "ABORT" messages, now about every 6-12 hours the hosts will without warning start reporting the the following messages..

2011-12-13T11:11:13.018Z cpu12:4786)WARNING: vmw_psp_rr: psp_rrSelectPath:1146:Could not select path for device "naa.600c0ff000123d035313724e01000000".
2011-12-13T11:11:13.018Z cpu12:4786)WARNING: NMP: nmp_IssueCommandToDevice:2954:I/O could not be issued to device "naa.600c0ff000123d035313724e01000000" due to Not found
2011-12-13T11:11:13.018Z cpu12:4786)WARNING: NMP: nmp_DeviceRetryCommand:133:Device "naa.600c0ff000123d035313724e01000000": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
2011-12-13T11:11:13.018Z cpu12:4786)WARNING: NMP: nmp_DeviceStartLoop:721:NMP Device "naa.600c0ff000123d035313724e01000000" is blocked. Not starting I/O from device.
2011-12-13T11:11:14.018Z cpu15:41049)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:972:Could not select path for device "naa.600c0ff000123d035313724e01000000".
2011-12-13T11:11:14.018Z cpu12:43188)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:972:Could not select path for device "naa.600c0ff000123dd17213724e01000000".
2011-12-13T11:11:14.018Z cpu16:45617)WARNING: NMP: nmpDeviceAttemptFailover:599:Retry world failover device "naa.600c0ff000123d035313724e01000000" - issuing command 0x412441152ac0
2011-12-13T11:11:14.018Z cpu16:45617)WARNING: vmw_psp_rr: psp_rrSelectPath:1146:Could not select path for device "naa.600c0ff000123d035313724e01000000".
2011-12-13T11:11:14.018Z cpu18:4786)WARNING: vmw_psp_rr: psp_rrSelectPath:1146:Could not select path for device "naa.600c0ff000123dd17213724e01000000".
2011-12-13T11:11:14.018Z cpu16:45617)WARNING: NMP: nmpDeviceAttemptFailover:658:Retry world failover device "naa.600c0ff000123d035313724e01000000" - failed to issue command due to Not found (APD), try again...
2011-12-13T11:11:14.018Z cpu16:45617)WARNING: NMP: nmpDeviceAttemptFailover:708:Logical device "naa.600c0ff000123d035313724e01000000": awaiting fast path state update...

The "Dreaded All Paths Down"

We are currently working with HP Level2 support to troubleshoot this from the storage side, and at the same time keep VMware Support engaged, but haven't found a smoking gun yet.

JohnTatumRVA · ‎12-13-2011

Damn, I'm actually in the process of doing an ESXi5 build on the P2000 G3 SAS, however I'm using 2 DL380 G7's instead of the UCS boxes. I've got the SC08 cards. If you need any help testing since I have similar hardware, don't hesitate to let me know since my environment won't go live for a couple weeks.

SKIRK505 · ‎12-13-2011

I'll be sure to keep this thread updated then with anything that we come across with HP Support. One thing that we did notice while going through documenation is that there are two different (and current) HP Docs that give two different examples of how to cable the SAS connections.. HP still hasnt commented on them yet thought..

Document 1 (how we currently have it cabled)

HP P2000 G3 SAS MSA System

User Guide

http://bizsupport2.austin.hp.com/bc/docs/support/SupportManual/c02254382/c02254382.pdf

Page39

Document2 (how we prviously had it cabled)

HP P2000 G3 MSA System

Installation Instructions

http://bizsupport1.austin.hp.com/bc/docs/support/SupportManual/c02523110/c02523110.pdf

Page 4

As you can see these documents are in disagreement with one another.

Also since starting this thread, we have installed the HP VAAI Plugin for vSphere5 (the HP VAAI version 2 Plugin)

durakovicduro83 · ‎12-13-2011

Hi i think that is a bug or no driver in version of Esxi5

try driver install driver http://downloads.vmware.com/d/info/datacenter_cloud_infrastructure/vmware_vsphere/5_0#drivers_tools

and use the last version of esxi 5 or install the last patch.

Or try to install version of esxi form hp that have driver integrated

http://communities.vmware.com/message/1822056

or

https://passport2.hp.com/hppcf/login.do?hpappid=PDAPI_SWD&TYPE=33554433&REALMOID=06-000f1e8c-4673-1b...

Cheers,

Denis

To err is human - and to blame it on a computer is even more so

SKIRK505 · ‎12-20-2011

To: JohnTatumRVA

Well we think we have come across the problem and it is specific to the Cisco UCS (in case it’s the UCS, but it could possibly affect other hardware as well depending on the vendor)

To recap.. the issues we are seeing is that the vSphere host is reporting that it is losing connectivity to the HBA's. Before while troubleshooting we believed this was isolated to the external storage, via the SAS HBA, but it appears that it is also losing connectivity to the internal storage as well. In the Cisco UCS the local storage controller is in fact a PCIe HBA, and after really digging we saw that the vSphere Host was also reporting lose of connectivity to it as well.

We got this link from our Integrator/Reseller..

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=103026...

We haven't found these exact messages in any logs (yet, we are still digging through logs), but after reading this KB article we confirmed with Cisco that there is a bug in the current UCS firmware related to the "Interrupt Remapping" function.

Cisco TAC knew right off the bat what we were experiencing and immediately recommended making the change to the iovDisableIR setting in either vSphere or Disabling this feature in the UCS Bios. They stated that you disable the it in either place, which ever you feel more comfortable with, but it is not required to disable it in both. The change does require a reboot to take effect and it is recommended that you put the vSphere in maintenance mode beforehand.

TAC stated that this is the recommended workaround for the bug, and that there is expected to be a new firmware release for the UCS coming in 2-3 weeks that will resolve this.

We normally see the events about every 6-18 hours, so after making the change this morning we plan to monitor the system for 48 hours. If everything runs clean for the duration, then I think we can safely say this is the root cause and the workaround until the new firmware release.

Bruce64 · ‎12-30-2011

Hello SKIRK505

did you still get any errors after workaround execution?

Thanks

SKIRK505 · ‎12-30-2011

Actually yes we have. Its looking now that this is actually two different issues with the UCS.

1) There is a bug in the firmware, but the work around is confirmed to resolve it.

2) Driver / Firmware Interoperability issue with the HP SC08E HBA and the UCS.

Its almost a no win here.. HP will only support the SAS MSA if you use the SC08E HBA, but Cisco blunty states that they do not support the SC08E HBA in the UCS C210, and further more go onto to say that there is most likely an issue with the Drivers that that VMware uses for the HBA and the Bios Firmware that the UCS is using.

So we have switched back to using the LSI MegaRaid 9280-4i4e HBA, which is supported by Cisco. We are monitoring it now, but aren't getting our hopes up.

So far this Hardware combination + vSphere 5 has turned out to be huge problem. Everyone, even VMware, keeps saying to downground ESX to 4.1

PeterCr · ‎02-06-2012

This is the Cisco bug ID for UCS C210 M2 - CSCtw68712

You can track it here http://www.cisco.com/cisco/psn/bssprt/bss

biokovo · ‎05-04-2012

Hi.

We also have a similar problem with P2000 SAS and HP DL380G7 servers with SC08e card on esx 5i U1.

Did you find out what is the reason for this?

Thanks on any useful information