Hello Everyone,
I have been having some vSAN issues where drives will show degraded when in actuality they are fine confirmed by in LSI controller and via hardware status. The drives (SSD) seem to become degraded once I/O hits the disks like (i.e. creating a VM). All hosts have 2 x SSD with one of those SSD tagged as a HDD. I wrote a post about it back in March here which has screenshots and more detail. Life has been very busy (chasing my 17 month around ) finally have time to continue troubleshooting. Everything is on the HCL for vSAN expect the actual SSD drives.
Below is the hardware specs:
Hosts:
Supermicro X10SAE-O:
Any assistance would be greatly appreciated.
Thanks
Dave
Apologize for the delay. Over the last couple of day I updated my homelab to vSphere 6 (Thanks to EVALExperience) and am happy to report vSAN is running beautifully over the 20GB infiniband network. I was able to create and clone VMs and the SSDs did not go into an unhealthy status. Thanks everyone for their input and suggestions!
I am going to see if I can get 6 different SSDs that are on the HCL and test.
I would be curious about two things.
1.)What is the behavior if the capacity tier is an actual HDD instead of an SSD marked as such?
2.)What is the behavior if this is setup as an all SSD vSAN in v 6?
Thank you, Zach.
Hello Zach,
1. I will put in 3 hdd tomorrow evening and try it again.
2. I can try that but my HBA controller (9211-8i) has been removed from the HCL last I checked.
I will report back. Thank you for the reply.
Mine isn't on the HCL either for vSAN 6, but I'll probably be trying out the upgrade soon.
Considering vSAN is a software abstraction layer for the storage, as long as it's functional on 5, I'm not sure of any version 6 features which would make it less so in 6.
Your systems are probably choking on the queue depths needed at the Flash disk level. Also what magnetic disks are your using? If they aren't SAS then those will choke on the queues as well. Also I have always had problems with VSAN 5.5 and marking an SSD as Magnetic and placing it in a VSAN. Usually tons of sense errors, and frequent drops/disconnects from the VSAN. I would recommend switching to Intel Enterprise PCI-E NVME SSD, and some SAS magnetics. If you upgrade to VSAN 6 you can use all flash arrays, but trying this in 5.5 will end in tears. Cheers
I too have the issues with marking the SSDs, but I have never once had them drop or disconnect from the VSAN. Also, I haven't had any issues with he magnetic disks. The SATA 7.2K have performed perfectly fine, but that may be because the SSDs I have are top tier.
The dropping i have had with SSD marking as magnetic usually surfaces during boot storm or mass storage policy changes. All in all i have never had a problem with SATA, except for crazy latencies. And you are totally right, your top tier SSD makes all the difference in the world. Cheers
Not to highjack your thread but I am having a simular issue.
I have been reluctant to post anything because people tend to just wave the HCL at you.
I have three Hp Generation 8 Micro servers. Each server has a 500Gb Crucial SSD and a 4TB 7.2k WD Red Pro's offered out as RAID0.
I have install ESXi v5.5 U3 and set up a cluster and enabled and licenced VSANs.
I had to downgrade the hpvsa driver to scsi-hpvsa-5.5.0-88OEM.550.0.0.1331820.x86_64.vib to get the B120i to play properly as I understand HP broke the driver.
All the drives are seen fine by the server under iLO and SSA.
The problem is now everything is hunky dory until I try to use the new datastore when the SSD's will randomly drop offline and the transfer fails. Sometimes one SSD, sometimes two sometimes all three.
I can tear down the datastore and instantly recreate it and the disks are all picked up.
Can't help thinking this is related to that B120i driver I had to downgrade.
Bit stumped what to do now though! I dont want to throw any more money at it. I could try upgrading to ESXi v6 as I know VSAN's was overhauled but its alot of effort to rule it out and there is still the driver issue.
Well firstly I would say your Flash disks are not up to the task. Besides that they shouldn't be dropping. Do you have logs for the event? Any sense error codes, there is probably a steady stream of these?
You may want to try placing the Flash disks on a different controller (if possible), and see if the problem persists. Does the drop happen foreach host?
The basic first steps would be to check your firmware (all around), and make sure you have the appropriate storage driver. When I say all around, i mean motherboard, HBA, and backplanes. When disks are dropping, its usually fixed via firmware/driver combo.
Also if you can try replacing the Crucial SSDs with something else. I am running 4x Intel 750 Series 400GB NVME disks, great price point and outstanding performance. They are not HCL or VSAN:HCL, but so far are an exceptional value. I am assuming your not using the Crucial SAS SSD, but the consumer SATA models. Even if you get "stable", your storage hardware won't deliver anything usable. Expect very high latencies, very low IOPS, and will get crazy worse during VSAN object operations. The VSAN:HCL is there to help you choose the right hardware. All in all consumer flash doesn't suffice, and without it SATA magnetics perform terribly. Also four hosts in a VSAN cluster really is the minimum.
Good Luck!
Oops almost forgot... Try re-conditioning your SSD drives for each host. Put VSAN into "Manual mode". Start with the first host, evacuate all the data, and then boot that host into a Linux live distro. Use "sudo fdisk -l" to find the path to your SSD "/dev/sdb". Then us "parted" to re-initialize the drive clean.
% sudo parted /dev/sdb
| mklabel msdos
| mklabel gpt
That will clear all partitions.
Boot back into ESXi and add the host back into the VSAN using Disk Management. Verify the host is now part of the VSAN, and all the host are communicating with each other (look for the yellow exclamation point on host icon). Take the host out of maintenance mode, and repeat the process for the next host in the cluster.
I have fixed various VSAN issue with drives dropping by using this procedure.
Thanks for your comments.
As the OP title suggests this is a Home lab setup also. I think 4 x PCIe flash cards may be a little over kill.
I think the problem (for me) was the almost non-existent queue depth on the B120i RAID controller. I have now installed a 3 x H220 SAS Host Bus Adapters into my 3 x Generation 8 Microservers and now they seem to be playing ball.
Cheers
Apologize for the delay. Over the last couple of day I updated my homelab to vSphere 6 (Thanks to EVALExperience) and am happy to report vSAN is running beautifully over the 20GB infiniband network. I was able to create and clone VMs and the SSDs did not go into an unhealthy status. Thanks everyone for their input and suggestions!
VirtualizingStuff have you installed ESXI 6.0 on the X9SCM-F?
i am currently setting up a home lab and i am wondering which version i should use
Yeap currently have the latest version of ESXI 6.x running in the homelab with no issues.
thanks! i was wondering if staying with 5.5 for my x9scm-f because i have found no info about people using 6.0 on it