Re: Experiencing random High disk latency with Del...

Isalmon · ‎05-30-2015

So we have been experiencing random periods of poor performance on our Dell 730 VSAN cluster. We are using the H730P in HBA mode. We have 2 400GB SSD's in each box and 4 600GB 15K SAS magnetic drives. What happens is we will notice poor performance on VM's,The vSphere web client (appliance) and when we ssh into each server and run esxtop and check the disk, the DAVG will be all over the place..from 30 to over 1000 ms. Opened a ticked with VMware and they suggested updating the bios and firmware. Dell had a firmware update to deal with high I/O latency. FW update 25.2.2.-0004. We updated the firmware and all seemed ok, then randomly the high disk latency will pop on any one of the servers.

We are running esxi 5.5 U2 build 2068190.

I know this card was recently certified, is something amiss? build version? crappy Firmware? I am prepared give up on pass-through and redo everything with raid0, we have a similar environment using dell R720 with H710 cards and NO issues whatsoever.

jonretting · ‎06-01-2015

What version driver are you using for the 730P? Vmware shows the latest for 5.5 being "megaraid_perc9 version 6.902.73.00-1OEM".

You might want to try and disable ASPM in the bios. What are the server specs? Without this information I can't really recommend any BIOS changes, as they may not be applicable to your servers. As a crazy long shot you could also try flashing your HBA's with LSI IT Firmware that matches your Dell 730P re-brand...

Isalmon · ‎06-01-2015

The servers are Dell PowerEdge 730's with two 400 GB SATA SSD's and 4 600GB SAS each with 256GB of RAM.

The Bios and firmware were all ugraded to the latest. (at least what dell directed me too). The PERC firmware version is 25.2.2.0004 They had us upgrade from 25.2.1.0037.

Now checking today I saw this..

ESXi 5.5 U2 megaraid_perc9 version 6.902.73.00-1OEM 25.3.0.0015 Partner Async

ESXi 5.5 U2 megaraid-perc9 version 6.901.57.00-1OEM 25.2.2.0004 Partner Async

ESXi 5.5 U2 megaraid_perc9 version 6.901.55.00.1vmw 25.2.1.0037 Partner Async

Our boxes still have megaraid_perc9 version 6.901.55.00.1vmw. Dell made no mention of this. So I believe this could be the issue. The mismatch of driver and firmware.

jonretting · ‎06-01-2015

After updating the drivers, I would strongly suggest rebuilding the entire VSAN array. Switching off VSAN on the cluster, and deleting all partition information on both magnetics and flash; then "mklabel gpt" them. Your current VSAN array was created under non optimal conditions, and even after the possible driver update "fix" it will still experience problems causing you to chase your tail. Good Luck

Isalmon · ‎06-01-2015

I hope I don't have to do that in a production environment. We have hundreds of VM's and people working.

jonretting · ‎06-01-2015

Well hopefully you don't and your problems go away. But if they don't... Figuring out your next troubleshooting path will be rather difficult without eliminating that possibility. Obviously you could somehow try to gather enough space on an NFS/iSCSI datastore, and just vmotion the vm storage to that. Moreover I didn't realize the cluster in question is a production environment?

Forgot to mention that Log Insight would really help you here, as you could identify any problems in your array to a greater extent. I would definitely do this before scratching the VSAN array if its production. Needless to say I would focus on migrating off the problematic storage medium first, better safe then sorry.

Isalmon · ‎06-01-2015

Yes unfortunately it is a production environment happening in two separate sites.

I really regret going with the 730P and using pass-through mode

no issues with the 710 and using raid 0

jonretting · ‎06-01-2015

Hmm... You could try the following, but I have never attempted such a thing, nor would I ever in a production environment. But in theory you might be able to rebuild onto RAID 0.

Put ESXi (VSAN Contributer) host into maintenance mode ensuring accessibility.
Shutdown the host and install a fresh magnetic disk. (same size or more than the largest disk in the vSAN)
Take the system out of Pass-Through, and enable AHCI.
Mount a nix live iso over IPMI to that machine
Do a DD copy of one of the vSAN disks to the new disk you put in.
Reboot and put that disk into RAID 0
Boot back into nix live and DD from the new disk to the new RAID 0 disk.
Repeat the process for each magnetic disk (with more spare disks added you could in theory do them all at one time)
Once all of them have been migrated, remove the extra players (spare disks), and boot into ESXi.

Once again I have never done an operation like this on VSAN, so i could be completely wrong. My bullet list probably needs tweaking as well, but i think you get the idea.

Just an idea!

Isalmon · ‎06-01-2015

Thanks for your help.

I though about this as well. My thinking was do a full data migration on one host..Then wipe disks and configure as raid 0..then migrate and repeat..but I am not sure if vsan will like a mix of raid 0 and pass-through in a cluster

while doing the migration. Lets hope this resolves the issue

jonretting · ‎06-01-2015

Yes -- Evacuating a host at a time would be best. Granted the performance wouldn't be consistent across hosts, but in theory it should work, since eventually all hosts will be RAID0. I don't think there would be a comparability issue with the mix of IT/RAID between hosts during the transition. Good Luck!

Forgot to mention that you should find out what type of hit if any you take on Queue Depths going from IT to RAID0.

zdickinson · ‎06-01-2015

I don't believe RAID0 vs Pass Through will change the 895 queue depth.

I would also do the rolling migration from Pass Through to RAID0. Especially if you have at least 4 hosts to always have FTT = 1. Thank you, Zach.

maxduncan · ‎06-02-2015

Hi there, I think I am running into the -exact- same problem as you, and just came across this post.

I'm running 4 R730xd's with the PERC H730 Mini. I was having countless problems with ssd's showing as perm failed, PSOd's, etc when this cluster was built in February of this year. After working exhaustively with vmw, it seemed to have settled down. Until just recently. The past few weeks, I have observed the cluster slow to an absolute unusable crawl. Writes went from 230mb/s to 7mb/s, and putting hosts into maintenance vsan maintenance mode is taking hours. I'm at my wits end, and thankfully found this thread.

Server config: ESX 5.5.0 build 2718055 (Dell)

- 12 4TB drives and 2 800GB SSD's per host

- H730 controllers are in HBA mode

- Firmware 25.2.2-0004

- Driver 6.901.57.00.1vmw

HCL: VMware Compatibility Guide: I/O Device Search

I've gone through endless support calls with vmw on the matter, and they had me install the firmware and driver noted above. They also confirmed that the H730P applies to the H730 mini although the HCL does not make note.

Any thoughts on the combination above? Greatly appreciate any advice or assistance. This has been a very frustrating journey.

jonretting · ‎06-02-2015

Are you using pass-through? Since the op sounds like he will be reconfiguring as RAID0. Have you tried RAID0? If you are crawling along at 7mb, your latency must be out of control as well. I would gather vsan stats for one of the hosts, and evacuate it + rebuild it as RAID0. Then gather stats again, makes sense? Are you getting sense errors? What errors if any are you getting?

Isalmon · ‎06-02-2015

Sorry to hear your woes,

well if you had firmware version 25.2.1.0037 previously, Then that is exactly the issue that Dell released 25.2.2.0004

PERC H730/H730P/H830 Mini/Adapter /FD33xS/FD33xD RAID Controllers firmware version 25.2.2.-0004

Fixes & Enhancements

Fixes:

- Corrects issue with excessive PERC I/O timeouts and SATA SSDs falling offline under heavy I/O.

Your exact issue.

Things have settled down for me after so far in both sites with these two combos

- Firmware 25.2.2-0004

- Driver 6.901.57.00.1vmw

Dell had me update to 25.2.2-0004 and I still had the driver 6.901.55.00.1 driver. I just updated the vib and so far so good

Isalmon · ‎06-02-2015

Things have calmed down for now. I will be watching for the next few weeks, but if the latency returns I will switch from pass-through.

I am monitoring the DAVG daily on all boxes and watching with vsan observer..

Isalmon · ‎06-02-2015

By the way max there is another critical update that may be relevant to you and that is the backplane firmware since you are using a lot of drives

Firmware_PG71P_WN64_1.09_A00-00.EXE

Dell 13G PowerEdge Server Backplane Expander Firmware

Dell 12Gb Expander Firmware for 13G Servers

Fixes & Enhancements

Fixes

-This release addresses issues with SATA devices in topologies with a mix of SAS and SATA drives, where a drive could be incorrectly marked offline

jonretting · ‎06-02-2015

That sounds like great news! Love it when a driver combo pulls through in the end, so relieving. Always remember that you do this stuff, so you are inherently very lucky with it.

maxduncan · ‎06-02-2015

Hi there - I've gone through every combination of passthrough and raid0 for the past 5 months. Stability was sketchy until I found this combination, and support validated the combination.

I'm not sure about the sense errors, but did find this. About time VMW.. http://www.vmware.com/files/pdf/products/vsan/VSAN-Troubleshooting-Reference-Manual.pdf

Can you give me your thoughts on my driver combination and the HCL link above? Should I upgrade to the latest firmware and driver releases? Is it really best practice to nuke vsan when doing this??

Thank you!

jonretting · ‎06-02-2015

I would follow the ops example. Try his combination of firmware/drivers, which I think is the very latest foreach.

Isalmon · ‎06-02-2015

I would stick with the combo you mentioned.

There is the newest one.. megaraid_perc9 version 6.902.73.00-1OEM and firmware 25.3.0.0015

The perc driver is available on vmwares HCL page but the 25.3.0.0015 is not listed on Dell's site. It may be available directly from LSI. That may be more risky.

As vmw reminded me, these are the Vendors drivers, they are vetted by vmware, but its only when a problem is reported to the vendor that the drivers are updated and fixed then re-vetted.

Dont go with the newest firmware unless its a critical fix like 25.2.2.00037 , or else you are essentially a guinea pig...

All

Experiencing random High disk latency with Dell H730P controller and VSAN

Fixes & Enhancements

Fixes & Enhancements