By the way max there is another critical update that may be relevant to you and that is the backplane firmware since you are using a lot of drives
Dell 13G PowerEdge Server Backplane Expander FirmwareDell 12Gb Expander Firmware for 13G Servers
Fixes & Enhancements
-This release addresses issues with SATA devices in topologies with a mix of SAS and SATA drives, where a drive could be incorrectly marked offline
That sounds like great news! Love it when a driver combo pulls through in the end, so relieving. Always remember that you do this stuff, so you are inherently very lucky with it.
Hi there - I've gone through every combination of passthrough and raid0 for the past 5 months. Stability was sketchy until I found this combination, and support validated the combination.
I'm not sure about the sense errors, but did find this. About time VMW.. http://www.vmware.com/files/pdf/products/vsan/VSAN-Troubleshooting-Reference-Manual.pdf
Can you give me your thoughts on my driver combination and the HCL link above? Should I upgrade to the latest firmware and driver releases? Is it really best practice to nuke vsan when doing this??
I would follow the ops example. Try his combination of firmware/drivers, which I think is the very latest foreach.
I would stick with the combo you mentioned.
There is the newest one.. megaraid_perc9 version 6.902.73.00-1OEM and firmware 25.3.0.0015
The perc driver is available on vmwares HCL page but the 25.3.0.0015 is not listed on Dell's site. It may be available directly from LSI. That may be more risky.
As vmw reminded me, these are the Vendors drivers, they are vetted by vmware, but its only when a problem is reported to the vendor that the drivers are updated and fixed then re-vetted.
Dont go with the newest firmware unless its a critical fix like 25.2.2.00037 , or else you are essentially a guinea pig...
Sounds like we're similar, to recap I'm:
- H730 mini's in HBA mode
- Firmware 25.2.2-0004
- Driver 6.901.57.00.1vmw
- running the 1.09 backplane update
I have been running this combo for a few months now and it has been stable, and went completely sideways without any changes to the environment, and latency goes crazy Only way out of it is to start rebooting. Which really sucks I might add.
Any additional thoughts are appreciated. Thanks guys.
At this point you really need to gather some logs. I would deploy a LogInsight Appliance, so you can rule things out beyond the vsan layer.
Hi Jon, thanks for the advice. I haven't used the log insight appliance, will it pull vsan specific logging?
It pulls everything in the vSphere. So you will have host level hardware/software, through your vswitches/netstack, Guests, and the VCS itself. If you setup a syslog relationship with your switches you can get that as well. By default ESXi host put themselves in verbose logging mode, so that will help you out of the gate. If you having contention for switching bandwidth, you then look into a netflow setup. LogInsight also creates a nice default dashboard layout, allowing you to instantly drill down to say "VSAN:Errors".
minus the backplane update we
are similar. Dell specified the backplane update affected users with more that 8 drives. We only have 6 per box at the moment, so I decided to skip that.
By the way I forgot to mention this high IO problem was not limited to VSAN we had two windows 2012R2 730xd same PERC controllers and RAID configure boxes used as file servers and with 10GB networking between them, we could not copy a 5G file. We though we were going crazy.
There was a serious issue with 25.2.1.0037. and the H730P controller. No issues since.
That should bring some comfort that your issue may be resolved
In times like this I have found it very useful to script a VSAN Observer gatherer script. So that you can quickly launch and gather stats on the VSAN during times of uncertainty.
Thanks for all of the recommendations guys. Are you of the opinion that 6.0 will fix stability problems like this, or at least make similar problems more obvious? I just saw the vsan health check plugin for 6 and it looks like it checks all the boxes. For the amount of time and frustration spent on vsan, I could be sitting on a stack of EQL's right now ;-|
Unless you know what your problems are now, I wouldn't recommend updating. VSAN 6 is noticeable faster, and VCS 6 is certainly better. You could just be having moments of contention, and some sort of hardware issue. Personally I don't like upgrading anything, if its that important it should be rebuilt-gold. It is my best practices. An ESXi host is quickly killed/re-installed with the latest version, settings and policies pushed, and storage destroyed. Why bother dealing with data on the host when you can move it off. So in essence yes, a rolling clean install of 6 foreach host makes sense. But what if an underlying issue makes a clean install impossible, and especially an even more unreliable upgrade. You could end up in downtime. Another slightly related note, most don't like the expense but having enough reserve NFS/iSCSI capacity to accommodate a full VSAN migration allows for incredible nimbleness in all situations.
What SSD's are you running.
I just discovered that I have Lite-On SSD drives that don't seem to be on the VMware VSAN HCL for ESXI 5.5 U2 hybrid
They are on the list for ESXi 6.0 ALL Flash Array.
This was resolved by finally upgrading the firmware for the SSD drives using the Dell Nautilus utility. The Lite-On drives has a bug that caused the poor performance.
Once it was updated the latency disappeared. We were not happy with Dell because these were sold to us and VSAN nodes and the drives were not on the HCL. We got replacement SSD drives (intel 3700) and have no issues.