As you mention that you are seeing multiple disks getting reset, there could be an issue on the controller, on the cache-tier device of one affected Disk-Group or on a single Capacity-tier device (if the DG is deduped).
Can you attach/PM the vmkernel.log and output of vdq -Hi to see can we narrow this down?
Other things to look for:
- aberrant latency on some devices (e.g. grep -i increased /var/log/vmkernel.log OR esxtop 'u')
- Indications of the controller driver/firmware having issues (dmesg OR less /var/log/vmkernel.log)
- Validate that the controller and its driver+firmware are on the HCL - please note that checking the Health check for this being Green is not sufficient as this won't check devices that are not on the HCL at all (as it just assumes you are using them for non-vSAN purposes) - validate that it has checked the controllers in question.
My findings so far...
HBA firmware/driver matches vSAN VCG
Commands run and disks with power reset issue in logs match disks listed in vdq -Hi
I do see a few events around the time the issue started relating to increased latency (Do not however particularly see high spikes in latency within VM/Backend Graphs)
In vmkernel.log amongst others I see events related to lsi_mr3 megasas_hotplug_work etc... I will need to check further as seeing VMware Knowledge Base might be related although servers running on ESXi 6.5U2
I am thinking of placing the host in MM and reboot it, leave it running for a few days and see if get similar events and if so raise with GSS