Hi - my current setup:
4 node vSAN running vSphere 6.2
3x Dell R720 with H710p mini PERC (Latest drivers)
1x Dell R730 with H730p PERC (Latest drivers)
10G dedicated vSAN switches/network
Each of the 4 nodes has:
2x disk groups each
Disk group 1 = 200GB SSD plus 5x 1.2TB 10K SAS
Disk group 2 = 200GB SSD plus 4x 1.2TB 10k SAS
I want to replace all eight of the 200GB SSD caching disks with new 800GB SSD drives that I have just bought.
I'm looking for advise around the best way to do this bearing in mind this is my production environment so trying to avoid any problems or down time (he says hopefully)
My understanding is that the procedure I need to follow is:
Some questions jump to mind around this procedure:
Any other suggestions or advice, feel free to shout.
Thanks,
I don't have any experience with the Dell R720 and RAID0 setup of VSAN, you may want to use maintenance mode for those hosts to do your upgrade as I'm not sure how VSAN and H710 RAID0 work together. I have R730xd systems (H730 passthrough) and disk group operations and physical disks swaps don't require my nodes to be in maintenance mode.
For removing disks groups, VSAN will present different options (Ensure Accessibility, Full Data Migration, No Data migration). In your case you should be choosing Full Data Migration to ensure 2 copies of data are always present during your upgrade. These 3 different options are the same options for maintenance mode. Maintenance mode actually does the same thing except it won't remove the disk groups so as you can see either choice you make between removing disks groups or doing maintenance mode will result in similar operations being done by VSAN.
Data moving evacuation takes a long time I have a 200TB cluster that took about 16 days to finish migrating disk groups. Disk performance DOES go down during these large moves, whether or not it will impact things just depends on your normal load. In my case it put a constant ~200MB/s of recovery throughput on my cluster and on average increased congestion to 50-75 (normally i see 0-15 values for congestion, at 100 you notice slowdowns, at 200 things are very slow, 220+ is unusable). On the plus side, VSAN does throttle recovery I/O depending on system load, during business hours resync speeds would slow down to 30-100MB/s and at night sometimes it would reach 300MB/s. In my own case I could tell my cluster was slower than normal but everything was still usable.
To minimize impact, you can kick off the disk group removal (or maintenance mode full migration) at the end of a business day or friday before weekend. It would give at least give resync a large window of time to do it's thing. It's possible it won't finish by the time business starts again but at least impact is minimized.
In regards to creating the 3rd disk group, while it would help, I don't think adding 1 spindle to each host is going to have any meaningful impact on performance for migration. Me personally, I would just destroy the old disk group and when you recreate it, you can add in the extra disk(s).
If you're able to physically remove/add the SSDs while the host is powered on, the hosts don't need to be placed into maintenance mode and you can do disk operations while the host is online. If you can't physically remove/add the SSDs (raid controller doesn't support it or your SSDs are NVMe) while the machine is powered on, then you will need to maintenance mode and I would recommend Full Data Migration as this is production data. I don't see a benefit in this case of using maintenance mode if I don't have to.
There isn't a way to remove the SSD cache without destroying a disk group, so really you will be rebuilding entire disk groups one at a time to do your SSD upgrade. For this reason I'd recommend you use Full Data Migration instead of Ensure Accessibility as you need to rebuild the data anyways. If you use Ensure Accessibility, the rebuilding of the diskgroup will be done while your data is effectively FTT=0 and if some other failure occurs at this time your data will be lost.
To recap:
Rinse and repeat for each disk group and only upgrade one disk group at a time. The whole process can take a while as the the resync/data move step will of multi TB can take many hours.
If VSAN resync/data moves cause problems for you during normal operation, then you should wait for weekend or low load to do your upgrade. If you don't observe any performance issues during resync, it's okay to do during regular hours. This really just depends on how much IO load your cluster is normally under. I'm not sure if you've ever had to upgrade VSAN disk format versions where it does a rolling disk group replacement but your own upgrade here is going to be very similar to that. Your upgrade will have a similar performance impact to a VSAN disk format upgrade procedure.
Edit: clarification/reword on FTT=0 scenario
Hi Elerium - many thanks for taking the time to give such a detailed response, its much appreciated.
Couple follow-up questions:
I have a potential other option to consider and would love any feedback from you all.
Apart from having my new 800Gb SSD drives I also have 4x 1.2TB SAS drives that I am going to be adding to my VSAN capacity layer for extra disk space. I have been holding back on that because my caching level disk space was so low. So in theory I do have the option of creating a third disk group on each host with one 800Gb SSD and one 1.2TB SAS drive.
Do you think its a good option to create a third disk group, get it up and running and THEN remove one of the other disk groups completely and simply add the other SAS drives into the third disk group?
The advantage of this is that I will have my new disk group created with very little risk before I delete an existing disk group, disadvantage is that initially at least it will be a much smaller disk group alongside the other bigger ones but VSAN will hopefully cope with that.
Cheers,
(Edit: number of available SAS disks mentioned)
I don't have any experience with the Dell R720 and RAID0 setup of VSAN, you may want to use maintenance mode for those hosts to do your upgrade as I'm not sure how VSAN and H710 RAID0 work together. I have R730xd systems (H730 passthrough) and disk group operations and physical disks swaps don't require my nodes to be in maintenance mode.
For removing disks groups, VSAN will present different options (Ensure Accessibility, Full Data Migration, No Data migration). In your case you should be choosing Full Data Migration to ensure 2 copies of data are always present during your upgrade. These 3 different options are the same options for maintenance mode. Maintenance mode actually does the same thing except it won't remove the disk groups so as you can see either choice you make between removing disks groups or doing maintenance mode will result in similar operations being done by VSAN.
Data moving evacuation takes a long time I have a 200TB cluster that took about 16 days to finish migrating disk groups. Disk performance DOES go down during these large moves, whether or not it will impact things just depends on your normal load. In my case it put a constant ~200MB/s of recovery throughput on my cluster and on average increased congestion to 50-75 (normally i see 0-15 values for congestion, at 100 you notice slowdowns, at 200 things are very slow, 220+ is unusable). On the plus side, VSAN does throttle recovery I/O depending on system load, during business hours resync speeds would slow down to 30-100MB/s and at night sometimes it would reach 300MB/s. In my own case I could tell my cluster was slower than normal but everything was still usable.
To minimize impact, you can kick off the disk group removal (or maintenance mode full migration) at the end of a business day or friday before weekend. It would give at least give resync a large window of time to do it's thing. It's possible it won't finish by the time business starts again but at least impact is minimized.
In regards to creating the 3rd disk group, while it would help, I don't think adding 1 spindle to each host is going to have any meaningful impact on performance for migration. Me personally, I would just destroy the old disk group and when you recreate it, you can add in the extra disk(s).
Thanks again for the great info and apologies for the slow response but I had been out of the office on business.
I will follow your advise and just try and wait out the disk sync times. My only problem now is trying to find a quiet weekend...
I just wanted to update this thread since I have finally completed the project some 6+ months later. In case anyone else needs to do the same, this may help.
At the time I was too busy with other projects and couldn't guarantee I could have this completed in one weekend, which as it turned out was a good decision.
So I decided to wait until the December shutdown period when most people are on leave and things are quiet in the office. I cancelled the leave I had planned which was not cool and did the change on the week before Christmas.
Another reason I had opted to wait is that I was planning on buying a new Dell server in the latter part of the year which I then used to storage vMotion some of my critical production boxes off before I did any changes on the vSAN cluster. I did this for two reasons:
My vSAN still had plenty of terabytes of data after the removal of the critical boxes though.
On Day 1 I removed the first Disk Group (full data migration) on Server 1 and waited. and waited. and waited.
It took many hours but finally finished successfully. I powered the server down and booted into the RAID controller interface, removed the first SSD from the first Disk Group, replaced with the new one, configured RAID 0 again and booted back into ESX.
I then added the new Disk Group back into the vSAN cluster. All went well but first I needed to mark my new SSD drive as an SSD drive, which was easy enough to do in v6.x
It basically took the whole first day to do just one Disk Group change. I repeated the same procedure for the second Disk Group on the first host and all went well again but took forever.
When it came to the second host I took a slightly different approach. I put the entire host into maintenance mode first, choosing the Full Data Migration option, this had the advantage of moving data off both Disk Groups at the same time. This still however took between 12 and 24 hours to complete per host. I left it running over night and when I came back in the morning it was either done or mostly done. I then took it out of maintenance mode, made sure all was good and then removed the first Disk Group, which now only took less than a minute to complete, rebooted, changed SSD and repeated the same for the second Disk Group. I only did one Disk Group at a time, I would suggest you do too if you are ever doing the same.
I then did the same for the remaining host machines, engage maintenance mode, wait forever and quick disk changes. In all it took the whole week to complete which was very tedious but I'm glad its done now.
Couple of tips or things to watch out for:
And that's it. Hope that helps if you doing the same.