VMware Cloud Community
Isalmon
Enthusiast
Enthusiast

Experiencing random High disk latency with Dell H730P controller and VSAN

So we have been experiencing random periods of poor performance on our Dell 730 VSAN cluster. We are using the H730P in HBA mode. We have 2 400GB SSD's in each box and 4 600GB 15K SAS magnetic drives.  What happens is we will notice poor performance on VM's,The vSphere web client (appliance) and when we ssh into each server and run esxtop and check the disk, the DAVG will be all over the place..from 30 to over 1000 ms. Opened a ticked with VMware and they suggested updating the bios and firmware. Dell had a firmware update to deal with high I/O latency. FW update 25.2.2.-0004. We updated the firmware and all seemed ok, then randomly the high disk latency will pop on any one of the servers.

We are running esxi 5.5 U2 build 2068190.

I know this card was recently certified, is something amiss? build version? crappy Firmware? I am prepared give up on pass-through and redo everything with raid0, we have a similar environment using dell R720 with H710 cards and NO issues whatsoever.

36 Replies
maxduncan
Contributor
Contributor

Thanks Salmon.

Sounds like we're similar, to recap I'm:

- H730 mini's in HBA mode

- Firmware 25.2.2-0004

- Driver 6.901.57.00.1vmw

- running the 1.09 backplane update

I have been running this combo for a few months now and it has been stable, and went completely sideways without any changes to the environment, and latency goes crazy Only way out of it is to start rebooting. Which really sucks I might add.

Any additional thoughts are appreciated. Thanks guys.

Reply
0 Kudos
jonretting
Enthusiast
Enthusiast

At this point you really need to gather some logs. I would deploy a LogInsight Appliance, so you can rule things out beyond the vsan layer.

Reply
0 Kudos
maxduncan
Contributor
Contributor

Hi Jon, thanks for the advice. I haven't used the log insight appliance, will it pull vsan specific logging?

Reply
0 Kudos
jonretting
Enthusiast
Enthusiast

It pulls everything in the vSphere. So you will have host level hardware/software, through your vswitches/netstack, Guests, and the VCS itself. If you setup a syslog relationship with your switches you can get that as well. By default ESXi host put themselves in verbose logging mode, so that will help you out of the gate. If you having contention for switching bandwidth, you then look into a netflow setup. LogInsight also creates a nice default dashboard layout, allowing you to instantly drill down to say "VSAN:Errors".

Reply
0 Kudos
Isalmon
Enthusiast
Enthusiast

Max,

minus the backplane update we

are similar. Dell specified the backplane update affected users with more that 8 drives. We only have 6 per box at the moment, so I decided to skip that.

By the way I forgot to mention this high IO problem was not limited to VSAN we had two  windows 2012R2 730xd same PERC controllers and RAID configure boxes used as file servers and with 10GB networking between them, we could not copy a 5G file. We though we were going crazy.

There was a serious issue with 25.2.1.0037. and the H730P controller. No issues since.

That should bring some comfort that your issue may be resolved

Reply
0 Kudos
jonretting
Enthusiast
Enthusiast

In times like this I have found it very useful to script a VSAN Observer gatherer script. So that you can quickly launch and gather stats on the VSAN during times of uncertainty.

Reply
0 Kudos
maxduncan
Contributor
Contributor

Thanks for all of the recommendations guys. Are you of the opinion that 6.0 will fix stability problems like this, or at least make similar problems more obvious? I just saw the vsan health check plugin for 6 and it looks like it checks all the boxes. For the amount of time and frustration spent on vsan, I could be sitting on a stack of EQL's right now ;-|

Reply
0 Kudos
jonretting
Enthusiast
Enthusiast

Unless you know what your problems are now, I wouldn't recommend updating. VSAN 6 is noticeable faster, and VCS 6 is certainly better. You could just be having moments of contention, and some sort of hardware issue. Personally I don't like upgrading anything, if its that important it should be rebuilt-gold. It is my best practices. An ESXi host is quickly killed/re-installed with the latest version, settings and policies pushed, and storage destroyed. Why bother dealing with data on the host when you can move it off. So in essence yes, a rolling clean install of 6 foreach host makes sense. But what if an underlying issue makes a clean install impossible, and especially an even more unreliable upgrade. You could end up in downtime. Another slightly related note, most don't like the expense but having enough reserve NFS/iSCSI capacity to accommodate a full VSAN migration allows for incredible nimbleness in all situations.

Reply
0 Kudos
Isalmon
Enthusiast
Enthusiast

What SSD's are you running.

I just discovered that I have Lite-On SSD drives that don't seem to be on the VMware VSAN HCL for ESXI 5.5 U2 hybrid

They are on the list for ESXi 6.0 ALL Flash Array.

Reply
0 Kudos
Isalmon
Enthusiast
Enthusiast

This was resolved by finally upgrading the firmware for the SSD drives using the Dell Nautilus utility. The Lite-On drives has a bug that caused the poor performance.

Once it was updated the latency disappeared. We were not happy with Dell because these were sold to us and VSAN nodes and the drives were not on the HCL. We got replacement SSD drives (intel 3700) and have no issues.

Reply
0 Kudos
zdickinson
Expert
Expert

I'm glad you found a solution.  I wanted to confirm that you were delivered a vSAN ready node that included hardware not on the vSAN HCL.  Is that correct?  Thank you, Zach.

Reply
0 Kudos
Isalmon
Enthusiast
Enthusiast

That is correct. Though I don't know what defines VSAN ready node to dell. For the 13G servers I only see a Dell R630. I do know the servers were spec'd for VSAN 5.5/ESXi 5.5. U2

When I updated the firmware, I noticed we had Lite-On drives. I plugged them into the HCL and they were not on the list for ESXi 5.5 U2 with hybrid disks. They only showed up for ESXi 6.0 All Flash arrays.

I confronted them about it,  and i think now they will do a deeper check into all components. While the update did fix the issue. I pressed Dell to replace the drives with SSD's on the HCL.

We replaced them with Dell Branded intel 3700's.

The issue was a serious firmware flaw with the garbage collection process/removal ..Under heavy I/O eventually they stop responding..

This was firmware version LPCF11XC. Dell is shipping Dell 730/730xd with these drives or intel 3610.(also not on the list for ESXi 5.5 U2) Updating to LPCF11XU in the Dell Nautilus SSD drive update corrected problem, but there was no way were gonna keep these. The specs dont even compare to the intel 3700 (drives we got with 12G servers)

So in the end, though Dell claim there was a documented issue with firmware version 25.2.2.0037 for the H730P. The SSD was the factor here.

Reply
0 Kudos
zdickinson
Expert
Expert

My understanding is that VMware provides the vSAN ready node configs:  https://partnerweb.vmware.com/programs/vsan/Virtual%20SAN%20Ready%20Nodes.pdf  Did you order from that?  Thank you, Zach.

Reply
0 Kudos
Isalmon
Enthusiast
Enthusiast

We worked with Dell and assumed based on the specs we gave the VMware based reps would put together a node based on this list. Our past order when VSAN was first GA was spec'd from this list.. R720/H710 controller..etc

When we ordered,  the 13G servers were out and you could no longer purchase  R720's. So they put together certified hardware for 13G servers.  I can tell by that list it was not a VSAN ready node then..

There are no 13G configs on that list for ESXi 5.5 only 6.

We got a R730..not a R730xd... where the gamble with Dell on the VSAN node list/or otherwise is Solid State Drive SAS Mix Use MLC 12Gbps 2.5in Hot-plug Drive


Dell substitutes in Dell branded SSD's of various manufacturers.. some of which are not on the HCL...a big oversight if you ask me.

Reply
0 Kudos
maxduncan
Contributor
Contributor

Hi Everyone, if you're running the dell provided SSD's, LiteOn, run the Nautilus utility and flash them ASAP.

I just ran it on Monday, and performance is now acceptable, and everything appears to be stable. STOKED!

Anyone have some performance metrics with Crystalmark that you can share based on our similar configurations?

I'm seeing 330/240MS/s on 2GB sequential.

Reply
0 Kudos
Isalmon
Enthusiast
Enthusiast

Hey Max tried to reach out to you when I found out about it almost 2 weeks ago.. Glad you resolved, We did not keep those liteon drives..as we have ESXi 5.5. They were certified for ESXi 6.0 all flash.

Hope your troubles are over.

Ian

Reply
0 Kudos
jonretting
Enthusiast
Enthusiast

What's your storage policy for that VM's bench-marked disk? What stripe-width?

Lowjax Cluster updated specs, and some performance details. | LowJax

FTT=1 Stripe-Width=1

ParaVirtual SCSI Adapter

Server 2012R2 w/2x VCPUs/4G Memory (probably should have used four VCPU)

CystalDiskMark 2G Seq (avg of four fix test runs)

480-700 MB/s Read

300-550 MB/s Write

AS SSD Benchmark Sequential test (avg of ten tests during component re-syncs and VDP backups)

1000-1700 MB/s Read

600-1300 MB/s Write

Reply
0 Kudos