Can anyone elaborate on why a storage outage causes hosts to go not responding or get into a state so bad they need to be rebooted? I thought the APD\PDL handling was supposed to help with this. Sometimes a 4 host cluster for example, will have 3 hosts completely hung up after a storage loss (ISCSI) and one will be accessible just fine in vCenter, direct connect and SSH. The other three cannot be accessed in any way and even the DCUI is super slow and will not respond to anything.
We have always just accepted this as normal behavior after a storage outage but being as sometimes the hosts remain online (with no datastores of course) and other times they completely hang makes me wonder if there is some better way to handle this. Any ideas?
Thank You
Hello hendersp3,
Sounds like Reservation conflicts, you should consider looking at the logs of the hosts when this is occurring to find out what is the problem and fixing it as opposed to straight away resorting to rebooting (/var/log/vmkernel.log and vobd.log to start).
Bob
I've also seen this behavior a few times in my career. And it didn't matter if the storage was connected via iSCSI, NFS or FC. To be honest, I could never solve the puzzle. At that time I suspected the Datastore Heartbeating or a persistent Scratch partition as the cause. For us the solution was to buy SAN systems that never fail. Which we also succeeded in doing. Since then I have internalized as best practice not to save on the storage system, but to buy something where the manufacturer can guarantee at least 99.999% availability in writing (at that time it was for example Hitachi, EMC or NetApp if it should not be so expensive).
But TheBobkin is basically right. The behaviour should not be like this and there must be a reason for it.