VMware Cloud Community
foodandbikes
Enthusiast
Enthusiast

My VSAN nightmare

This is a long post, but hopefully someone will get some useful info out of it.

Hopefully I can get some useful info from replies as well.

I have deployed 1 VSAN cluster, but based on my current experience with it I will be hesitant to recommend it again.

Let me walk you through my nightmare.

(3) Dell R530

- Perc H730 mini storage controllers. (Has a queue depth of 925 last time I checked.)

- 1 200G SSD

- 4 2T HDDs

- 10G networking with redundant switches

- 48G RAM

Dataset size

6T of data

Based on VMware's recommendation to start with SSD capacity that is 10% of the dataset size we started with 600G of SSD capacity.

Issue #1

Customer calls complaining of poor performance. VSAN benchmarked very well in the lab and the VMs were flying along for several weeks so I did not immediately think storage was an issue.

Looking into the performance issue I eventually found VSAN doing Component Resyncs and it had several hundred Gig remaining, with 4 hours left until completion.

During this time VM performance degraded so much that the customer sent employees home because email, files, print, applications, etc were all effectively down. Storage latency went into the thousands of ms. Exchange complains above 20ms, imagine how happy is was @ 2000ms.

I noticed the components that were doing a resync were VMs that earlier in the day another engineer made changes to. He increased the virtual disk size since the VMs were running out of room. Expanding the disks of VMs has never caused an issue and is something we regularly do doing hours.

When a drive is expanded on VSAN the drive is simply not just expanded, here's the process best as I can tell.

1. Tech expands drive from 500G to 650G

2. VSAN creates 3 new components (a component is basically a chunk up to 256G)

3. VSAN then copies data from the 2 existing 256G components into the 3 new components.

By default, a component will have a single stripe width, so a single disk is hammered hard while the data is being read or written.

Fault tolerance settings are set so data resides on 2 servers, so the high I/O is happening on multilpe disks on multiple hosts.

4. When that copy is done it will delete the original 2 components.

KEY THING TO NOTE:

1. You need double the disk space during an expansion. Have a 500G disk you are expanding to 650G, then you need to have capacity for that 500G of data to be duplicated. If you watch free capacity on the VSAN while this is happening space will be slowly consumed as the copy happens and once complete there will be a large jump in free space when the original components are deleted. If your disks are too large and there is not enough free space on a VSAN you will not be able to expand your drive.

Resolution to #1

After discussions with VMware and within VMware it was decided there is not enough SSD capacity, so the systems are going to spinning disk too often causing the slowness. At the time it sounded reasonable, but it's doing a resync of all the data, it has no choice but to go to disk.

So we buy 3 new 200G SSDs and I start a project of adding an SSD to each server.

I couldn't get a clear set of instructions on how to do it with a 3 host VSAN, so I will lay that out here for anyone that might need it.

The final result will be 2 disk groups per server, each group will have 1 200g ssd, and 2 2T HDDss. The current state is 1 200G SSD, and 4 2T HDDs.

1. Update firmware and drivers on everything in your server to meet the VMware HCL

2. Install the SSD drive into a host, and ONLY one host (See Issue #3 for why only one host)

3. Make sure the disk is in pass-thru mode on the storage controller

4. Edit the existing disk group on that host.

-remove 2 of the HDDs from the existing group. You'll get the option to evacuate data to another disk or to just limp along.

-evacuate data if you can, otherwise limp along. I chose limp along since that was my only option with a 3 node cluster.

5. create a new disk group using the new SSD and the 2 disks you just removed.

- VSAN will start to resync data back to these disks. It does not look to see if those components already exist, so it does a full copy of all the data it already contained.

6. when the resync is complete repeat 1-6 on remaining hosts.

Issue #2

After adding the SSD to one of the hosts and reconfiguring the storage group I figured I'd get to the next host the next day. I was hoping to be able to do them all in a single weekend.

About 24 hours after the first host was done, and before I started the second, we got a bunch of VSAN errors and VSAN on the host I had just updated had gone belly up, serving no data at all.

Looking at issues with VMware support it was determined that the SSDs were running different firmware versions and was the only thing anyone could come up with as to why VSAN had crashed.

Resolution #2

Update the disk firmware.

I updated the firmware on the SSDs and HDDs and it did a complete resync of data again. Recall that during resyncs VMs were basically useless, so it wasn't a pleasant experience for anyone.

Once that was sorted and it ran smoothly for a week I proceeded to do all the same steps on the remaining 2 hosts, each 1 week apart.

Issue #3

We decided we need to add a dedicated HDD to each host that is not part of the VSAN cluster. Since our hosts boot off SD cards there are capacity issues and warnings that never go away unless temp things can be directed to a HDD.

We ordered 3 500G HDDs from Dell. One of our engineers put them in today but did no configuration. 45 minutes later we started getting errors about VSAN being down on a host. "Virtual SAN device is under permanent failure ". That can't possibly be good.

VSAN disk claiming is in manual mode, so I know the disk wasn't added to the disk group.

I had him immediately pull the 500G disks from all hosts. It was too late however. Host #2 had lost both disk groups, and host #1 lost one of its disk groups. That means some data components are inaccessible and VMs started crashing.

Resolution #3

Reboot host #2 since it was in the worst shape. It came up normally, the disk groups looked fine but data was out of sync. Data started doing a resync, so instead of rebooting host #1 I just let the resync finish. When done there were still ~50% of the components in a degraded state, and some that were inaccesible. Mind you I have been on hold with Vmware support for 1.5 hours waiting for some help. When the resync was done I decided to reboot host #1. It came up normally, all components are now showing normal and it's doing a resync of a ton of data that will take an estimated 5 hours.

Remember issue #1, poor performance during resync. I can definitively say adding SSD made little difference. Most of my VMs are currently powered off since latency is so high that they won't boot successfully. Some machines that are running show disk latency of 500-4000ms.

Cause of #3

I think it might be due to disk firmware mis-match, but won't know until VMware answers the phone and we can take a look at it.

-----------------------------

My key takeaways from my experiences:

1. If it ain't broke, don't touch it.

2. If you have to touch it, only ever work on a single host at a time, and do host changes a week apart.

3. Expand disks after hours or during slow times. The resync can affect VMs you did not make changes to.

4. Expect VSAN to crash when adding new disks to the host, thus only work on one host at a time. I am 2 for 2 on crashes when adding disks.

5. Update firmware on everything when making any physical changes, especially the disks themselves which I think is often overlooked.

6. Get more SSD capacity than you think you need, cache never hurts anyone.

7. Never use a single disk group, multiple groups make managing and reparing easier.

8. Don't use 7200 rpm drives. They are generally fine, but during resync you will wish you have gotten 10k or better.

9. Consider using stripe width greater than 1.

10. Use 4 hosts minimum if you can.

11. Resync won't cause slowness on all VMs, just ones that have components sharing disks.

9 Replies
AlexanderLiucka
Enthusiast
Enthusiast

I think you have missed the recommendation of minimum 4 hosts. With 3 hosts you allays will have problems. You need 1 host to can do maintenance.

also I can't imagine how you have worked with 3 host, before you go to vsan? what you have used for the storage before you go to vsan?

I think you have to be very brave to do maintenance during working hours without a capacity for this purpose.

with the current version of vsan I think it is a great product, compared to other alternatives on the market. I have used Promise iSCSI, Open-E, StarWind and Microsoft windows for iSCSI.

maybe after 2 or 3 more versions of vsan it will be very solid. now it is a good product without enough best practices, documentation, monitoring and troubleshooting tools.

also i'm looking that you have been brave again to put 4 x 2 TB HDD on 1 x 200 GB SSD.

about the RPM of the HDD. I don't know how much is your load but i'm almost sure that 7200 RPM is enough for it. like yo sad that "you will wish you have gotten 10k or better", same will be happening with the faster HDD. MAYBE only with all Flash you will be happy!

and again. DO not DO maintenance during working hours if you don't have enough capacity for this. allays do the maintenances during the weekends.

go and get your VSAN SexiPanels | SexiGraf and be happy to can see your VSAN status in "real time". the stats are pulled every 1 minute.

Reply
0 Kudos
Lawrie201110141
Contributor
Contributor

Hi For my benefit can you tell me your version of vSAN that was in use when running all of this. foodandbikes

Reply
0 Kudos
zdickinson
Expert
Expert

Good morning, I would agree with all of your recommendations.  When I hear of these problems I see two problems.  I wish 4 hosts was not a recommendation, but a requirement.  The second is that IT seems to treat vSAN different than their other storage.  I know I did.  I treated it more like my current VMware environment.  vMotion and do maintenance during the day.  I realized early on that the problems I was having with vSAN were due to my state of mind.  I was doing things to it that would NEVER do to my EMC array during business hours.  Treat your vSAN like storage, not computer.  Thank you, Zach.

foodandbikes
Enthusiast
Enthusiast

6.0 at the start, currently on 6.1.

After talking with support today, they have no idea why the systems had issues yesterday after installing the new disks. There is a new SSD firmware version out that I should upgrade to, but the thought was the crash should not have happened.

We also discussed even having the drive connected, and the recommendation is to not have any non-VSAN disks connected to the same storage controller as VSAN disks.

The section Impact/Risks is exactly what happened to our servers. Disks erroneously reported as failed.

VMware KB: Considerations when using both VSAN and non-VSAN disks with the same storage controller

The original thought was to use the new disks for VSAN trace files, but that's no longer an option unless we add a second controller. We are now having to rethink how we want to handle the trace files.

I agree with the comment about VSAN needs to be treated as a real SAN and not a server. I would never think of mixing disk speeds, types in a regular SAN (Nimble, NetApp, EMC) unless specifically designed for it, so why do it in VSAN?

Reply
0 Kudos
zdickinson
Expert
Expert

It sounds like your use case for one HDD out of vSAN is perfect.  We do the same thing.  Best if not on the same controller.  If it is, make sure it's in the same mode (RAID or passthrough) as the other disks.  And then only place trace and log files on it.

Disk types and speeds are mixed all the time in arrays.  You could easily have the controller cache, FAST SSD Cache, and then the actual data sitting on 7200 SATA drives on an EMC VNX.  To extend it further, you can do tiered storage at the persistence layer.  Mix SSD, 15k, 10k, etc...

Another issue I've raised before is that it's made seen that vSAN is easy to implement, maintain, upgrade, etc...  I think that's incorrect, there seems to be more skill, planning, and care needed for vSAN than a typical 3 host connected to an array solution.  I would recommend it for DR, VDI, non-critical apps, etc...  Unless there is much in house expertise.  Thank you, Zach.

Reply
0 Kudos
AlexanderLiucka
Enthusiast
Enthusiast

zdickinson,

I have more hope for the vSAN. I'm evaluating it around a month and I have put the vSAN In very bad situation and I haven't lost any data. only big problem was short downtime for the VMs running on it. Also my biggest problem with the VMs was with Linux based VMs which are very fragile to disk latency. the windows VMs are more durable. Smiley Happy

how I sad maybe after two more versions of vSAN and it will be very strong competitor on the market.

Reply
0 Kudos
elerium
Hot Shot
Hot Shot

There are issues with H730 raid controller to be aware of. I'm running 2 different VSAN clusters and the H730s are the only thing that still cause me some pain. If you run in RAID0 mode, you will likely be quite stable from my own testing if you disable the raid card caches. In HBA mode (the only supported HCL mode), I've run into every KB bug possible and still today run into host crash instabilities roughly every 35 days. Rumor has it a new H730 ESXi driver/firmware will be out soon that fixes most of these issues but it is still undergoing testing/qualification. In the meantime pay close attention to these KB articles and make sure you are updated to recommended HCL firmware/drivers for the H730, backplane, BIOS, your SSDs and HDDs.

Things that I've observed are disk resets issued by the raid controller which cause all the drives to drop off or PSODs (reboot will fix this) that show up in iDRAC lifecycle controller logging in addition to the failures seen in the KB articles below.

VMware KB: Avoiding a known drive failure issue when Dell PERC H730 controller is used with VMware V...

VMware KB: Using a Dell Perc H730 controller in an ESXi 5.5 or ESXi 6.0 host displays IO failures or...

VMware KB: Deployment guidelines for running VMware Virtual SAN and VMware vSphere VMFS datastores o...

I would say it's more a problem with Dell and the H730 card/driver/firmware than it is with VSAN, but still an issue if this is the hardware you're running.

AnatolyVilchins

@Alexander Liuckanov, thanks for mentioning us. Just wanted to confirm VMWare VSAN being a really great product and only a couple of iterations/versions needed for it to become brilliant.

I am just curious so may ask you about what exactly you did not like about our product? You should definitely give it another try since we have a very great free product https://www.starwindsoftware.com/starwind-virtual-san-free  that allows to turn a couple of old or decommissioned servers into a fault-tolerant SAN or NAS with SMB 3.0 or NFS on top.

BTW how did you manage to use MSFS ISCSI at all? 🙂 It is not on VMWare HCL, has half a minute delays on I/O path even under light load and basically barely works at all.

starwind-virtual-san-free-70.png

It’s hard to add something reliable to RPM discussion since we do not have any information about the needed IOPs/workload, still have to confirm that 10k or eve 15k disks will not give you much more speed with only 4 of these.

Kind Regards, Anatoly Vilchinsky
Anton_Kolomyeyt
Hot Shot
Hot Shot

Interesting! We've seen quite different picture: had to increase default Windows I/O timeout to keep Windows VMs from "Delayed Write Failed" issue (I have to mention this is stress environment and I doubt anybody has anything close in production, there's simply no reason to).

Upcoming version of VMWare VSAN will get more sophisticated QoS so situation will get MUCH better Smiley Wink

--

I have more hope for the vSAN. I'm evaluating it around a month and I have put the vSAN In very bad situation and I haven't lost any data. only big problem was short downtime for the VMs running on it. Also my biggest problem with the VMs was with Linux based VMs which are very fragile to disk latency. the windows VMs are more durable.

Reply
0 Kudos