VMware Cloud Community
VirtualizingStu
Enthusiast
Enthusiast
Jump to solution

Homelab vSAN woes

Hello Everyone,

I have been having some vSAN issues where drives will show degraded when in actuality they are fine confirmed by in LSI controller and via hardware status. The drives (SSD) seem to become degraded once I/O hits the disks like (i.e. creating a VM). All hosts have 2 x SSD with one of those SSD tagged as a HDD. I wrote a post about it back in March here which has screenshots and more detail. Life has been very busy (chasing my 17 month around Smiley Happy) finally have time to continue troubleshooting. Everything is on the HCL for vSAN expect the actual SSD drives.


Below is the hardware specs:

Hosts:

Supermicro Servers (X9SCM-F)  x 2:

  • CPU: 2 x Intel Xeon E3-1230v2 “Ivy Bridge”
  • Motherboard:2 x Supermicro X9SCM-F
  • Raid Controller:2 x LSI Internal SATA/SAS 9211-8i
  • Memory:2 x Kingston 32GB Kit DDR3 1600MHz PC3
  • Disks:2 x Lexar Echo ZX 16GB
    • SSD: 4 x Sandisk Ultra II 240GB
  • Network Cards:
    • 2 x HP Infiniband DDR Dual Port HCA Adapter 20Gbps
    • 2 x HP NC360T Dual Port PCI-e Gigabit Card
  • Power Supply: 2 x Seasonic 400W 80 Plus Platinum Fanless ATX12V/EPS12V

Supermicro X10SAE-O:

  • CPU: Intel Xeon E3-1231 “Haswell”
  • Motherboard: Supermicro X10SAE-O
  • Raid Controller: LSI Internal SATA/SAS 9211-8i
  • Memory: Kingston 32GB Kit DDR3 1600MHz PC3
  • Disks:
    • Lexar Echo ZX 16GB
    • SSD: 2 x Sandisk Ultra II 240GB
  • Network Cards:
    • HP Infiniband DDR Dual Port HCA Adapter 20Gbps
    • HP NC360T Dual Port PCI-e Gigabit Card
  • Power Supply:
    • SS-520FL2 520W ATX12V / EPS12V 80 PLUS PLATINUM

Any assistance would be greatly appreciated.

Thanks

Dave

1 Solution

Accepted Solutions
VirtualizingStu
Enthusiast
Enthusiast
Jump to solution

Apologize for the delay. Over the last couple of day I updated my homelab to vSphere 6 (Thanks to EVALExperience) and am happy to report vSAN is running beautifully over the 20GB infiniband network. I was able to create and clone VMs and the SSDs did not go into an unhealthy status. Thanks everyone for their input and suggestions!

View solution in original post

Reply
0 Kudos
15 Replies
VirtualizingStu
Enthusiast
Enthusiast
Jump to solution

I am going to see if I can get 6 different SSDs that are on the HCL and test.

Reply
0 Kudos
zdickinson
Expert
Expert
Jump to solution

I would be curious about two things.

1.)What is the behavior if the capacity tier is an actual HDD instead of an SSD marked as such?
2.)What is the behavior if this is setup as an all SSD vSAN in v 6?

Thank you, Zach.

Reply
0 Kudos
VirtualizingStu
Enthusiast
Enthusiast
Jump to solution

Hello Zach,


1. I will put in 3 hdd tomorrow evening and try it again.

2. I can try that but my HBA controller (9211-8i) has been removed from the HCL last I checked.

I will report back. Thank you for the reply.

Reply
0 Kudos
beeguar
Enthusiast
Enthusiast
Jump to solution

Mine isn't on the HCL either for vSAN 6, but I'll probably be trying out the upgrade soon.

Considering vSAN is a software abstraction layer for the storage, as long as it's functional on 5, I'm not sure of any version 6 features which would make it less so in 6.

Reply
0 Kudos
jonretting
Enthusiast
Enthusiast
Jump to solution

Your systems are probably choking on the queue depths needed at the Flash disk level. Also what magnetic disks are your using? If they aren't SAS then those will choke on the queues as well. Also I have always had problems with VSAN 5.5 and marking an SSD as Magnetic and placing it in a VSAN. Usually tons of sense errors, and frequent drops/disconnects from the VSAN. I would recommend switching to Intel Enterprise PCI-E NVME SSD, and some SAS magnetics. If you upgrade to VSAN 6 you can use all flash arrays, but trying this in 5.5 will end in tears. Cheers

Reply
0 Kudos
ChrisKuhns
Enthusiast
Enthusiast
Jump to solution

I too have the issues with marking the SSDs, but I have never once had them drop or disconnect from the VSAN. Also, I haven't had any issues with he magnetic disks. The SATA 7.2K have performed perfectly fine, but that may be because the SSDs I have are top tier.

Reply
0 Kudos
jonretting
Enthusiast
Enthusiast
Jump to solution

The dropping i have had with SSD marking as magnetic usually surfaces during boot storm or mass storage policy changes. All in all i have never had a problem with SATA, except for crazy latencies. And you are totally right, your top tier SSD makes all the difference in the world. Cheers

Reply
0 Kudos
cheesyboofs01
Contributor
Contributor
Jump to solution

Not to highjack your thread but I am having a simular issue.

I have been reluctant to post anything because people tend to just wave the HCL at you.

I have three Hp Generation 8 Micro servers. Each server has a 500Gb Crucial SSD and a 4TB 7.2k WD Red Pro's offered out as RAID0.

I have install ESXi v5.5 U3 and set up a cluster and enabled and licenced VSANs.

I had to downgrade the hpvsa driver to scsi-hpvsa-5.5.0-88OEM.550.0.0.1331820.x86_64.vib to get the B120i to play properly as I understand HP broke the driver.

All the drives are seen fine by the server under iLO and SSA.

The problem is now everything is hunky dory until I try to use the new datastore when the SSD's will randomly drop offline and the transfer fails. Sometimes one SSD, sometimes two sometimes all three.

I can tear down the datastore and instantly recreate it and the disks are all picked up.

Can't help thinking this is related to that B120i driver I had to downgrade.

esx-a.jpg

esx-b.jpg

Bit stumped what to do now though! I dont want to throw any more money at it. I could try upgrading to ESXi v6 as I know VSAN's was overhauled but its alot of effort to rule it out and there is still the driver issue.

Reply
0 Kudos
jonretting
Enthusiast
Enthusiast
Jump to solution

Well firstly I would say your Flash disks are not up to the task. Besides that they shouldn't be dropping. Do you have logs for the event? Any sense error codes, there is probably a steady stream of these?

You may want to try placing the Flash disks on a different controller (if possible), and see if the problem persists. Does the drop happen foreach host?

The basic first steps would be to check your firmware (all around), and make sure you have the appropriate storage driver. When I say all around, i mean motherboard, HBA, and backplanes. When disks are dropping, its usually fixed via firmware/driver combo.

Also if you can try replacing the Crucial SSDs with something else. I am running 4x Intel 750 Series 400GB NVME disks, great price point and outstanding performance. They are not HCL or VSAN:HCL, but so far are an exceptional value. I am assuming your not using the Crucial SAS SSD, but the consumer SATA models. Even if you get "stable", your storage hardware won't deliver anything usable. Expect very high latencies, very low IOPS, and will get crazy worse during VSAN object operations. The VSAN:HCL is there to help you choose the right hardware. All in all consumer flash  doesn't suffice, and without it SATA magnetics perform terribly. Also four hosts in a VSAN cluster really is the minimum.

Good Luck!

Reply
0 Kudos
jonretting
Enthusiast
Enthusiast
Jump to solution

Oops almost forgot... Try re-conditioning your SSD drives for each host. Put VSAN into "Manual mode". Start with the first host, evacuate all the data, and then boot that host into a Linux live distro. Use "sudo fdisk -l" to find the path to your SSD "/dev/sdb". Then us "parted" to re-initialize the drive clean.

%  sudo parted /dev/sdb

|    mklabel msdos

|    mklabel gpt

That will clear all partitions.

Boot back into ESXi and add the host back into the VSAN using Disk Management. Verify the host is now part of the VSAN, and all the host are communicating with each other (look for the yellow exclamation point on host icon). Take the host out of maintenance mode, and repeat the process for the next host in the cluster.

I have fixed various VSAN issue with drives dropping by using this procedure.

Reply
0 Kudos
cheesyboofs01
Contributor
Contributor
Jump to solution

Thanks for your comments.

As the OP title suggests this is a Home lab setup also. I think 4 x PCIe flash cards may be a little over kill.

I think the problem (for me) was the almost non-existent queue depth on the B120i RAID controller. I have now installed a 3 x H220 SAS Host Bus Adapters into my 3 x Generation 8 Microservers and now they seem to be playing ball.

Cheers

Reply
0 Kudos
VirtualizingStu
Enthusiast
Enthusiast
Jump to solution

Apologize for the delay. Over the last couple of day I updated my homelab to vSphere 6 (Thanks to EVALExperience) and am happy to report vSAN is running beautifully over the 20GB infiniband network. I was able to create and clone VMs and the SSDs did not go into an unhealthy status. Thanks everyone for their input and suggestions!

Reply
0 Kudos
bilbobagginz
Contributor
Contributor
Jump to solution

VirtualizingStuff have you installed ESXI 6.0 on the X9SCM-F?

i am currently setting up a home lab and i am wondering which version i should use

Reply
0 Kudos
VirtualizingStu
Enthusiast
Enthusiast
Jump to solution

Yeap currently have the latest version of ESXI 6.x running in the homelab with no issues. Smiley Wink

bilbobagginz
Contributor
Contributor
Jump to solution

thanks! i was wondering if staying with 5.5 for my x9scm-f because i have found no info about people using 6.0 on it

Reply
0 Kudos