VMware Cloud Community
gmtx
Hot Shot
Hot Shot

VSAN on Dell 730 - Ready for Production?

Getting ready to pull the trigger on a Dell server cluster upgrade to R730s and VSAN. I can't help but notice all the issues with Dell PERC controllers and PSODs, NVMe drivers and poor performance, and extended delays getting updated drivers on the HCL. I'm having serious doubts now about VSAN replacing my current (extremely stable) SAN, but I need better disk performance and capacity increases that will result in an expensive SAN forklift upgrade if I don't make the switch.

Is there a general consensus out there about the stability of VSAN on Dell hardware at this point? Is it just too early to be thinking about replacing my SAN with VSAN on Dell?

Thanks for any thoughts on my situation.

14 Replies
zdickinson
Expert
Expert

Good morning, I hope all is well.  I have a hybrid vSAN in DR and a traditional SAN (VNX2 5400) in production.  What I have learned with vSAN in DR is that I love converged infrastructure... and that I hate to manage it.  In my case (team of 2, one SAN to rule them all), I would not replace my SAN with vSAN.

Or would I.

Our next infrastructure refresh is planned to be VxRail, which will run on Dell hardware and use vSAN.  I just don't want to manage the converged infrastructure for all the reasons you mentioned.  For my use case I would not implement vSAN in production.

Or would I.

If I didn't have the budget for VxRail, I would implement vSAN.  Everything on the HCL.  All flash.  6 nodes.  RAID 6.  Dedupe, Compression, 10 Gb for vSAN.

Hoe that helps, Zach.

Reply
0 Kudos
gmtx
Hot Shot
Hot Shot

Thanks Zach. You sound conflicted. Smiley Happy   But I get what you're saying and appreciate you taking the time to respond.

Reply
0 Kudos
elerium
Hot Shot
Hot Shot

I had very few issues after the last driver release and fixes provided for VSAN 6.1, I would say VSAN 6.1 is very stable and production ready if using HCL drivers, firmware and the timeout settings in the KB. I was running many months on this with zero issues.

VSAN 6.2 i'm seeing problems with my NVME setup, it may be specific just to Intel NVME with the workaround being to disable the checksum feature in 6.2. If you use any 3rd party integrations with vsphere (I use Zerto for example), you may also run into funky issues in VSAN 6.2 as some .vmdk operations seems to be a problem for 3rd party apps (this isn't specific to R730/H730). Anyone else that's on VSAN 6.2, just try downloading a .VMDK file from the datastore browser (full or web client), all you can download is the header and not the content, this in turn can cause problems for plugins if they need to access underlying VMDKs.

I would say you can upgrade to VSAN 6.1 and not run into any major issues.

Reply
0 Kudos
gmtx
Hot Shot
Hot Shot

Looking to use Intel NVMe 3700s for the cache layer, so good to know about the checksum issue. Interesting to see Intel's response to this as pretty much "trust our drives, you don't need checksum".

I've been running 400GB 3700s in a VDI environment for almost two years now (starting with Intel's beta NVMe driver), and so far they've been pretty much flawless, so perhaps they have a point. Smiley Wink

Reply
0 Kudos
elerium
Hot Shot
Hot Shot

Yeah i don't buy Intel's "you don't need checksum" statement as it doesn't cover other possible corruption scenarios on the capacity drive side or other possibilities like at the network layer.

I unfortunately don't have any other enterprise SSDs (NVMe or SATA) to test with so I don't know if this is an issue with NVMe, checksum implementation or Intel driver/hardware. I'm using P3600 and P3700 for Intel NVMe.

Reply
0 Kudos
adamroffler
Contributor
Contributor

Watch out for the PERC H730P Mini controller that dell will ship with those chassis. We recently purchased and setup a 6 host VSAN enabled cluster using Dell PE R730xd with Intel Xeon E5 procs. they also were shipped with the PERC H730P Mini raid controller. back in august last year when they were shipped out they were at FW version 25.2.2-0004 and the Backplane sitting at 1.09. In january we experienced a major outage in which three of the hosts disk groups were put in to a permanent error state by VSAN. this was done proactively by VSAN in response to adaptor resets of the H730P controllers on the hosts. It was a little known issue at that point in time and Dell and VMware were working out a major bug fix patch for the issue. we upgraded the firmware to the latest version available in January (basically a emergency patch made available from Dell). We also were required to perform an upgrade of the MEGA_RAID_Perc9 drivers on the ESX hosts themselves. This stabilized the VSAN array but was still causing some random PSOD issues with the hosts. Then in May a new more stable release of the patch was push out and we had to do the FW upgrade to the H730P Cards again. No more PSOD issues and VSAN works well again but still not a fun experience! Smiley Sad

Reply
0 Kudos
zdickinson
Expert
Expert

Bummer, sorry you had to go through that.  That's why it's a bit hard to say that if you have "Pet" applications and one storage/compute cluster to run all workloads, it's hard to recommend vSAN.  Also, from my experience and observation, choose an LSI card. Thank you, Zach.

Reply
0 Kudos
elerium
Hot Shot
Hot Shot

On a go forward basis, I think it will be harder to recommend a LSI brand card as none of them are on HCL for all flash only clusters. It does seem like VMware is putting more work to ensure that controllers provided by large hardware OEMs like HP/Cisco/Dell will work with VSAN and these controllers are usually LSI rebranded controllers anyways. The Dell controller related problems have been because of bugs in LSI 3108 chipset (Required VSAN and ESXi configuration for controllers based on the LSI 3108 chipset (2144936) | VMwar...)  which after a year of bugs finally seems to be stable.

Reply
0 Kudos
A13x
Hot Shot
Hot Shot

Similar experience here and many PSOD, one host came up and as soon as it did BOOM PSOD and all hosts were up and down like yoyos. I performed a firmware and driver update which seems to have stabilised things. I have not had any further issues to date on DELL 730s however I constantly keep on checking the forums as the PSOD yoyo and VSAN was something out of a horror movie.

wreedMH
Hot Shot
Hot Shot

We have 64+ R730xds with the HBA330 and they have been solid as a rock. They are running 6.5 U1+ builds.

A13x
Hot Shot
Hot Shot

The early days of version 6 and the buggy firmware on the raid controller crippled most of us. Now with esxi 6.5 and more up to date firmware they are a lot more stable.

Vdiallstar
Contributor
Contributor

Imm running Dell R730 & R640 clusters on 6.6 u1 and they are rock solid. Is this a DIY approach or have you purchased ready nodes through Dell? The HBA330 is the best controller IMO. its no harm to run HCI bench on the environment to burn it in and flush out any potential issues. Also once health & Configuration assistant are green you are good. If you want to reach out directly im happy to chat

wreedMH
Hot Shot
Hot Shot

We just ordered 80 Dell C6420s with the HBA330. They are going to go into 4x 20node vSAN clusters, or a break out close to that. We have been POC them and they also seem rock solid.

4 nodes in a 2U form factor! Oh and each node as 1TB of physical memory. Smiley Happy

Reply
0 Kudos
johandijkstra
Enthusiast
Enthusiast

We had 10 Dell R730 with the Perc Controllers. Lot's of issues...

Since we replaced the Perc controllers with the 330 mini controllers, solid as a rock!

Reply
0 Kudos