Why does a san outage cause hosts not responding?

hendersp3 · ‎10-29-2018

Can anyone elaborate on why a storage outage causes hosts to go not responding or get into a state so bad they need to be rebooted? I thought the APD\PDL handling was supposed to help with this. Sometimes a 4 host cluster for example, will have 3 hosts completely hung up after a storage loss (ISCSI) and one will be accessible just fine in vCenter, direct connect and SSH. The other three cannot be accessed in any way and even the DCUI is super slow and will not respond to anything.

We have always just accepted this as normal behavior after a storage outage but being as sometimes the hosts remain online (with no datastores of course) and other times they completely hang makes me wonder if there is some better way to handle this. Any ideas?

Thank You

TheBobkin · ‎11-02-2018

Hello hendersp3,

Sounds like Reservation conflicts, you should consider looking at the logs of the hosts when this is occurring to find out what is the problem and fixing it as opposed to straight away resorting to rebooting (/var/log/vmkernel.log and vobd.log to start).

Bob

sk84 · ‎11-02-2018

I've also seen this behavior a few times in my career. And it didn't matter if the storage was connected via iSCSI, NFS or FC. To be honest, I could never solve the puzzle. At that time I suspected the Datastore Heartbeating or a persistent Scratch partition as the cause. For us the solution was to buy SAN systems that never fail. Which we also succeeded in doing. Since then I have internalized as best practice not to save on the storage system, but to buy something where the manufacturer can guarantee at least 99.999% availability in writing (at that time it was for example Hitachi, EMC or NetApp if it should not be so expensive).

But TheBobkin is basically right. The behaviour should not be like this and there must be a reason for it.

--- Regards, Sebastian VCP6.5-DCV // VCP7-CMA // vSAN 2017 Specialist Please mark this answer as 'helpful' or 'correct' if you think your question has been answered correctly.

All

Why does a san outage cause hosts not responding?