What version driver are you using for the 730P? Vmware shows the latest for 5.5 being "megaraid_perc9 version 6.902.73.00-1OEM".
You might want to try and disable ASPM in the bios. What are the server specs? Without this information I can't really recommend any BIOS changes, as they may not be applicable to your servers. As a crazy long shot you could also try flashing your HBA's with LSI IT Firmware that matches your Dell 730P re-brand...
The servers are Dell PowerEdge 730's with two 400 GB SATA SSD's and 4 600GB SAS each with 256GB of RAM.
The Bios and firmware were all ugraded to the latest. (at least what dell directed me too). The PERC firmware version is 25.2.2.0004 They had us upgrade from 25.2.1.0037.
Now checking today I saw this..
ESXi 5.5 U2 megaraid_perc9 version 6.902.73.00-1OEM 25.3.0.0015 Partner Async
ESXi 5.5 U2 megaraid-perc9 version 6.901.57.00-1OEM 25.2.2.0004 Partner Async
ESXi 5.5 U2 megaraid_perc9 version 6.901.55.00.1vmw 25.2.1.0037 Partner Async
Our boxes still have megaraid_perc9 version 6.901.55.00.1vmw. Dell made no mention of this. So I believe this could be the issue. The mismatch of driver and firmware.
After updating the drivers, I would strongly suggest rebuilding the entire VSAN array. Switching off VSAN on the cluster, and deleting all partition information on both magnetics and flash; then "mklabel gpt" them. Your current VSAN array was created under non optimal conditions, and even after the possible driver update "fix" it will still experience problems causing you to chase your tail. Good Luck
I hope I don't have to do that in a production environment. We have hundreds of VM's and people working.
Well hopefully you don't and your problems go away. But if they don't... Figuring out your next troubleshooting path will be rather difficult without eliminating that possibility. Obviously you could somehow try to gather enough space on an NFS/iSCSI datastore, and just vmotion the vm storage to that. Moreover I didn't realize the cluster in question is a production environment?
Forgot to mention that Log Insight would really help you here, as you could identify any problems in your array to a greater extent. I would definitely do this before scratching the VSAN array if its production. Needless to say I would focus on migrating off the problematic storage medium first, better safe then sorry.
Yes unfortunately it is a production environment happening in two separate sites.
I really regret going with the 730P and using pass-through mode
no issues with the 710 and using raid 0
Hmm... You could try the following, but I have never attempted such a thing, nor would I ever in a production environment. But in theory you might be able to rebuild onto RAID 0.
- Put ESXi (VSAN Contributer) host into maintenance mode ensuring accessibility.
- Shutdown the host and install a fresh magnetic disk. (same size or more than the largest disk in the vSAN)
- Take the system out of Pass-Through, and enable AHCI.
- Mount a nix live iso over IPMI to that machine
- Do a DD copy of one of the vSAN disks to the new disk you put in.
- Reboot and put that disk into RAID 0
- Boot back into nix live and DD from the new disk to the new RAID 0 disk.
- Repeat the process for each magnetic disk (with more spare disks added you could in theory do them all at one time)
- Once all of them have been migrated, remove the extra players (spare disks), and boot into ESXi.
Once again I have never done an operation like this on VSAN, so i could be completely wrong. My bullet list probably needs tweaking as well, but i think you get the idea.
Just an idea!
Thanks for your help.
I though about this as well. My thinking was do a full data migration on one host..Then wipe disks and configure as raid 0..then migrate and repeat..but I am not sure if vsan will like a mix of raid 0 and pass-through in a cluster
while doing the migration. Lets hope this resolves the issue
Yes -- Evacuating a host at a time would be best. Granted the performance wouldn't be consistent across hosts, but in theory it should work, since eventually all hosts will be RAID0. I don't think there would be a comparability issue with the mix of IT/RAID between hosts during the transition. Good Luck!
Forgot to mention that you should find out what type of hit if any you take on Queue Depths going from IT to RAID0.
I don't believe RAID0 vs Pass Through will change the 895 queue depth.
I would also do the rolling migration from Pass Through to RAID0. Especially if you have at least 4 hosts to always have FTT = 1. Thank you, Zach.
Hi there, I think I am running into the -exact- same problem as you, and just came across this post.
I'm running 4 R730xd's with the PERC H730 Mini. I was having countless problems with ssd's showing as perm failed, PSOd's, etc when this cluster was built in February of this year. After working exhaustively with vmw, it seemed to have settled down. Until just recently. The past few weeks, I have observed the cluster slow to an absolute unusable crawl. Writes went from 230mb/s to 7mb/s, and putting hosts into maintenance vsan maintenance mode is taking hours. I'm at my wits end, and thankfully found this thread.
Server config: ESX 5.5.0 build 2718055 (Dell)
- 12 4TB drives and 2 800GB SSD's per host
- H730 controllers are in HBA mode
- Firmware 25.2.2-0004
- Driver 6.901.57.00.1vmw
I've gone through endless support calls with vmw on the matter, and they had me install the firmware and driver noted above. They also confirmed that the H730P applies to the H730 mini although the HCL does not make note.
Any thoughts on the combination above? Greatly appreciate any advice or assistance. This has been a very frustrating journey.
Are you using pass-through? Since the op sounds like he will be reconfiguring as RAID0. Have you tried RAID0? If you are crawling along at 7mb, your latency must be out of control as well. I would gather vsan stats for one of the hosts, and evacuate it + rebuild it as RAID0. Then gather stats again, makes sense? Are you getting sense errors? What errors if any are you getting?
Sorry to hear your woes,
well if you had firmware version 25.2.1.0037 previously, Then that is exactly the issue that Dell released 25.2.2.0004
PERC H730/H730P/H830 Mini/Adapter /FD33xS/FD33xD RAID Controllers firmware version 25.2.2.-0004
Fixes & Enhancements
- Corrects issue with excessive PERC I/O timeouts and SATA SSDs falling offline under heavy I/O.
Your exact issue.
Things have settled down for me after so far in both sites with these two combos
- Firmware 25.2.2-0004
- Driver 6.901.57.00.1vmw
Dell had me update to 25.2.2-0004 and I still had the driver 6.901.55.00.1 driver. I just updated the vib and so far so good
Things have calmed down for now. I will be watching for the next few weeks, but if the latency returns I will switch from pass-through.
I am monitoring the DAVG daily on all boxes and watching with vsan observer..