Manual (or scripted) preferred I/O path change - i...

NFerrar · ‎01-25-2021

Hi,

We're changing the SAN fabric switches in our environment, fortunately the hosts all have spare HBA ports so we have cabled up (and zoned for a test LUN/datastore). It's a vSphere ESXi v5.5 16-host cluster (yes I'm aware this isn't supported anymore!)

However I'm not sure how best to do the cut-over. In testing a manual preferred path change via the GUI (whilst copy and zip operations were running in a VM on a test datastore that I was flipping the preferred path on) didn't seem to interrupt I/O but I realise that's not a guarantee that say a DB transaction on a busy SQL server wouldn't be affected. I can't find any low level details about what happens with a manual preferred path change though. Presumably if it doesn't interrupt I/O that's already in progress (but just sends the next I/O up the new preferred path) then it should always be invisible to a guest VM (isn't that similar to having a round robin pathing policy and it hitting the IOPS count?).

The alternative is putting a host in maintenance mode and changing the preferred path on each datastore once all the VMs were off it but we have a separate issue whereby several of our critical app VMs won't vMotion as it's a 1Gb vMotion network and the vRAM size + rate of change of memory means vMotion times out. So that means incurring downtime which is going to be a pain to arrange.

Has anyone done this sort of cut-over with on-line VMs running on the host you're making the change on (and has any advice)?

I'm also thinking if we do do it on-line we probably don't want to script it as we'd either need a script for each host/datastore combination (at which point running them is more hassle than just using the GUI) or if we did all datastores on a host at the same time it might trigger an issue in itself with a mass pathing change going on?

nachogonzalez · ‎01-25-2021

Every time I have done it I have taken the host(s) to maintenance mode just to be sure and minimize possible impacts.
Also, I have used scripts (there are quite a few available).

I can't find any low level details about what happens with a manual preferred path change though. Presumably if it doesn't interrupt I/O that's already in progress (but just sends the next I/O up the new preferred path) then it should always be invisible to a guest VM (isn't that similar to having a round robin pathing policy and it hitting the IOPS count?). --> I'm not 100% sure on this, but I think you are right.

Blog Nachogonzalez.com.ar Twitter @nachogon_

Ardaneh · ‎01-27-2021

Hi

Changing the preferred SAN path manually can affect your environment, it's 100%! but that doesn't mean your VMs or Applications will become unavailable or stuck at some not responding state (OS can handle it perfectly). if I want to be more precise, it depends on your applications. SQL Server can handle some IO loss and try to retransmit it's IO requests (with no data lost!). We have lots of SQL servers in our environment and did these kinds of things to check the persistency of the service and it was a successful task!

Another consideration that you should take is your SAN storage capability. if you have an Active/Active storage array, you can easily change your storage policy to RR before powering off your old switches. we did this scenario like this:

- Proper cabling to our new switches from our storage and hosts and create our new zones (test it as you did)

- Changing datastore policy to RR (for all hosts)

- Shutting down ports on our old switches (host by host, that means we disabled one port for our 1st host and after checking everything, we went for the next)

- Powering off our switches

But for Active/Passive storage arrays, only the paths to the active controller will be used in the RR policy, so be more careful.

I hope the best

NFerrar · ‎01-27-2021

Changing the preferred SAN path manually can affect your environment, it's 100%!

Could you explain why though? I get that in a path failure situation or even manually disabling the active I/O path might cause some I/O to be lost or need to be resent but I was sort of hoping if all you're doing is changing the preferred path that ESXi would do it in a bit more of a controlled manner (i.e. finish off active I/O on the current path then switch to the new preferred path). It must have a mechanism to do it itself when you use a RR policy?

I'm also not sure I follow how your use of RR policy to cut-over helps. When you do the step to shutdown the port on the switch, if the active I/O is on that path at the time isn't it just going to cause the same issue as a path failure? And with RR don't you have less control of which path is being used when you shutdown the switch port?

I'm far from an expert on storage policies and how the I/O works at a low level though so maybe I'm just not understanding why the RR policy would help. We have an active/active SAN but use a Fixed storage policy (but I'd be fine with temporarily changing it to RR if it would guarantee the fabric switch cut-over wouldn't cause an issue in a guest VM).

We've got around 200 VMs (with 16GB+ vRAM) that likely won't vMotion hence doing this properly and putting a host into maintenance mode is going to be a nightmare to arrange downtime for.

All

Manual (or scripted) preferred I/O path change - impact on guests?