Are you positive you replaced the faulty drive and how did you re-install ESXi? (assuming that's what you meant by "in use by OS")
From what you describe it is likely that hostd was kaput likely as a knock-on effect from something else failing - in this case always use localcli instead of esxcli and if this is not returning output then either use dmesg or Alt+F12 at the DCUI to see the vmkernel logging and figure out what is breaking.
"So after all this, and considering we are going to deploy a stretched cluster configuration, what is the best recommended way to add new capacity / cache drives to a vSAN cluster ?"
All of this information can be found on docs.vmware in the sub-sections here:
"Should I add new drives only when all hosts are here ?"
No, this shouldn't matter.
"Or no hot add at all ?"
Hot-add capabilities are dependent on the controller in use supporting this feature (and usually the firmware in use).
"Put host one by one in maintenance mode, with data evacuation, then add new drive ?"
If hot-add is supported as per the above then this isn't necessary, if you do want/need to do MM/power-off then you could expedite this process by checking that your back-ups are good then using MM with 'Ensure Data Accessibility' option (increase clom repair delay timer if you think adding disks and reboot will take longer than 60 minutes).
"And what about stretched cluster considerations ?"
No special considerations other than ensure you are adding the storage evenly to each site (preferably homogenous per node).
The failed disk is out of the problem, it's an internal RAID 1 (on different controller) that did not disturb the server at all (failed few days before my vsan issues).
Thanks for the two links you provided, but I don't see best recommended way to do it.
So in "conclusion", if my hba card accept hot add, I should not have any issues hot adding disks to production running ESXis, there is no need to put host in maintenance mode or any other kind of precaution procedure ?
About stretched cluster, should I claim new disks once all disks have been added in all hosts in both sites or I can expand site 1 and expand site 2 later ?