We have deployed 8 Dell R730xd hosts equipped with H730 Mini controller at the end of last year. All hosts are running vSphere 6 build 4600944. Each host has 2 vSAN disk groups with hybrid configurtion. Each disk group has 1 Intel SSDSC2BX40 (400GB) and 6 Toshiba AL14SEB060N (600GB).
The firmware of all devices are on the vSAN HCL since the deployment. Since March of this year, the vSAN starts having issues. The vSAN would report “Flash drives dead or error” on one of the hosts and the 2 disk groups on that host would drop out from the vSAN. In iDRAC, it would show all disks are fine, but the log would show all disks are reset when the issue is occurring. When the issue occurs, we will have to reboot the server in order for the disk groups on the host to be working again. However, the next day or so, the same problem will occur on the other host. One after another. After the host is rebooted, the host works fine for about 1 month, then back to the same cycle.
We have worked with VMware and Dell support for months on this issue. We have replaced the controllers and backplanes on some of the hosts at the beginning. That didn’t help. Newer firmware came out for the controller (25.5.2.0001) and backplane (3.32), we did the upgrade. Those didn’t help neither. We have purchased additional drives for each host and planned to expand the vSAN capacity. However, the support recommended not to make any changes since they’re still trying to figure the issue. We have planned to purchase additional servers to build other vSAN clusters. Due to this issue, those projects have been on-hold. Feels like we’re stuck. The only thing is to wait for the support to come up with any new ideas. Every time they find something (ex. a new firmware just comes out) and we make change, we will have to wait a month to see if the problem comes back again or not. This has been dragging for months. Does anyone have any comments or suggestions?