I'll detail the design, I'd like to know if the result is expected or if I've designed something wrong!
ToR-1 and ToR-2 have a 20GB LAG between them and I'm running VRRP to move the routing engine back and forth, I can restart either switch and the SAN fails over along with the ISP that's linked into both switches. No network outages at this level.
Leaf-1 and Leaf-2 in fabric A1 and A2 of a blade chassis have a 20GB LAG between them and then a 20GB LAG from Leaf-1 to ToR-1 and Leaf-2 to ToR-2, reloading Leaf-2 and ToR-2 and ToR-1 I have no outages but reloading Leaf-1 I have an outage of about 1 minute, I think that's the duration of the switch reloading and becoming available again.
The blades have a single 10GB dual port NDC 1 port goes to Leaf-1 and the other to Leaf-2
The dvSwitch/Portgroups are configured as such with 'Route based on originating virtual port' Failback set to Yes
Active (vmnic0)/Standby (vmnic1)
Active (vmnic1)/Standby (vmnic0)
Active (vmnic0)/Unused (vmnic1)
Active (vmnic1)/Unused (vmnic0)
Active (vmnic0)/Active (vmnic1)
Utilizing vSphere 6.7 latest SP and Patch
The hope is to not see any outage and everything just keeps humming along but I feel I've missed something at the dvSwitch level that is not allowing the NICs to route traffic out of the remaining online switches, if I power down Leaf-1 traffic does route out of Leaf-2 but not without an outage.
When you say "outage" in this case, what specifically is unavailable? Is it storage? Is it ESXi management vmkernel connectivity? All things? Or do you even know? First thing is to figure out what network services remain visible when you reload that leaf switch and go from there. Also, what is this IP-based storage? NFS or iSCSI?