rossanderson
Enthusiast
Enthusiast

LSI_SAS and other disk errors in several VMs all of a sudden

Jump to solution

We have now all of a sudden started seeing similar issues in our environment on some older 2008 R2 VMs running on both M3 and M3 blades. Main error message is LSI_SAS - I've seen these issues all over the internet and through this board but nothing has fixed our issue as yet. Thinking that we need to do a full UCS firmware upgrade at this point but would like to know the root cause and any workarounds.

UCS Firmware- INFRA - 2.2.3b, M4 blades - 4.1.30c, M3 blades - 2.2.3b

Core - Nexus 5548s

Storage - Nimble AF-5000 with iSCSI

VMware - 6.0 U2

Windows Errors - LSI_SAS, Event ID 129, Reset to device, \Device\RaidPort0, was issued. OR Disk, Event ID 51, An error was detected on device \Device\Harddisk0\DR0 during a paging operation.

This has just started to become a problem after running this config for many years (3-5 years at least, more for the UCS config). Anyone come up with a solid root cause and/or solution? The proposed VMware KB article says to update the LSI scsi controller drivers in the OS, but that hasn't helped on any affected VMs. All vendors say its not their issue so as far as I can tell, the UCS is the only commonality at this point.

Thanks!

0 Kudos
1 Solution

Accepted Solutions
rossanderson
Enthusiast
Enthusiast

FYI - just closing the loop on this. Our issue didn't end up being due to LSI_SAS controller drivers in the OS (like the VMware KB article references). Our issue was due to data replication software that we use which replicates VMware storage traffic to our DR site by way of an I/O splitter plug-in installed on the VMware Hosts. The plug-in had stopped communicating correctly with vCenter, so some I/O requests were failing intermittently. This ultimately caused some disk corruption on one machine, but after reconnecting our Replication software to vCenter, normal operation resumed.

In summary, the issue was not due to LSI_SAS drivers, UCS firmware, VMware issues, back-end storage saturation or anything else - the issue was due to the data splitter software that we use to replicate data to our DR site.

Rgds

---------------------------------------------------------------------------------------------------------

Was it helpful? Let us know by completing this short survey here.

View solution in original post

3 Replies
sjesse
Leadership
Leadership

I would agree with the upgrade, but make sure the network card in the ucs is supported with the version esxi your on and the upgrade firmware as well, I saw something similar with old blades using emulex on connect converaged network adapters, and it was solved by upgrading the firmware.

0 Kudos
rossanderson
Enthusiast
Enthusiast

Thank you for the response. I cannot see how there would be a NIC support issue now after running this way for many years. Just in case, we upgraded the firmware on one M4 blade yesterday but the latency appears to have actually gone up, even though there are now just one or two VMs running on this host.

0 Kudos
rossanderson
Enthusiast
Enthusiast

FYI - just closing the loop on this. Our issue didn't end up being due to LSI_SAS controller drivers in the OS (like the VMware KB article references). Our issue was due to data replication software that we use which replicates VMware storage traffic to our DR site by way of an I/O splitter plug-in installed on the VMware Hosts. The plug-in had stopped communicating correctly with vCenter, so some I/O requests were failing intermittently. This ultimately caused some disk corruption on one machine, but after reconnecting our Replication software to vCenter, normal operation resumed.

In summary, the issue was not due to LSI_SAS drivers, UCS firmware, VMware issues, back-end storage saturation or anything else - the issue was due to the data splitter software that we use to replicate data to our DR site.

Rgds

---------------------------------------------------------------------------------------------------------

Was it helpful? Let us know by completing this short survey here.

View solution in original post