Anyone has faced the same issue as me? Some of my VMs became inaccessible after vm reboot. I logged in to vmware vcenter web client and found out that one of the ssd in host 1 reported that there is permanent disk failure. can anyone advise what i should do next?
Losing a disk group's cache device means that the whole disk group will be degraded, and needs to be rebuilt.
If you are not familiar with the procedure to fix such an issue, I strongly recommend that you open a support case with VMware.
Did the VMs Objects (e.g. vmdks, namespaces, vswps) have an FTT=1 Storage Policy and were compliant with this? If so, then a single failure shouldn't cause anything to become Inaccessible.
With this in mind, I see you have a red warning on the Disk-Group adjacent to the one you are are seeing marked as Unhealthy - what is the triggered alarm for this other Disk/Disk-Group or is this some other alert?
Whether the Cache-Tier device of the Unhealthy Disk-Group is failed due to a logical or physical issue is another question - if you can please attach the vmkernel.log of the 2 hosts in question I can take a look, but if you do have a support contract then I would advise opening a P1 Support Request with my colleagues in GSS immediately, this is what we are here for.
Also if you could share the output of the below commands run on both hosts mentioned above and or any other hosts with issues (if you don't want to post it here that is fine, you can PM it to me):
# esxcli vsan storage list
# vdq -Hi
# vdq -q
and from any host:
# cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep -o 'state\\\":\ [0-9]*' | sort | uniq -c