As part of my planned preventative maintenance, I'm looking to be able to identify SSD devices that will need to be replaced, prior to a failure. In essence, I'd like to access predicted failure information.
Our ESXi installation is running RAID-1, but the main VM host area has no RAID
I'm running ESXi 7.0U3 on Dell XR12.
Any pointers to documents or KB articles would be welcomed as I've not had that much success finding anything.
You may use hardware tab of ESXI to check status of the underlying hardware. Or better to use idrac to check the status of the underlying hardware.
Regards,
Sachchidanand
Hello,
you can check the HW health status for ESXi servers from GUI or CLI, below are some references for your support:
From vSphere Client: Monitor Hardware Health Status in the vSphere Client
From CLI: KB 2040405
Hey! Identifying a failing SSD is crucial to ensure data safety and maintain a smooth operating environment, especially in an ESXi environment.
For Dell servers, the iDRAC interface is a valuable tool. It often provides predictive failure alerts for storage devices. Here's what you can do:
Dell iDRAC: Log into the iDRAC web interface and navigate to the hardware section to check the status of the SSDs. Any issues are typically flagged, including predictive failures.
ESXi: From your ESXi host, you can utilize the esxcli command to fetch storage device information. Here's a quick command:
Code
esxcli storage core device smart get -d=device_id
Look for attributes such as 'Media Wearout Indicator', 'Reallocated Sectors Count', 'Program Fail Count', etc. A significant deviation from their usual values can hint at an impending SSD failure.
vCenter Server: If you're using vCenter, it might provide alerts and notifications related to hardware health, including SSD status.
Dell OMSA (OpenManage Server Administrator): This tool provides a comprehensive health status of Dell server components, including SSDs. If it's installed on your ESXi host, it can be used to monitor hardware health.
Finally, for detailed procedures and potential alarms, check Dell's official documentation or VMware's Knowledge Base articles. Dell's community forums can also be a valuable resource, as many administrators share their experiences and solutions there.
Remember, while predictive failures give you a heads-up, it's always a good idea to maintain regular backups of crucial data.
Hope this helps and wishing you a seamless maintenance!
Cheers,
Ansar
Hello,
As Sachchidanand (and other) already told you, also in my opinion, the better option is to use the iDRAC, possibly setting it up to send you alarms/alerts based on the occurrence of a whole range of events related to the underlying hardware, there are more than one methods available.
However, in practice, it is information that could be somehow misleading because the intervening time from the detection of the conditions that "could lead to a malfunction" and the "malfunction" can be so short as not to have time to intervene proactively, It has concretely happened to me on a couple of occasions that not even ten minutes have passed between the "alarm" and the subsequent "fault". In a RAID array (or other by design inherently reliable solution) it is different.
Regards,
Ferdinando
Better to use esxcli commands of course
Hello,
Sorry but I don't agree so much on this, receiving a warning that something "may go wrong" is quite different than realizing it when perhaps the "failure has already occurred". I don't go to consult every day and repeatedly at regular intervals via command line what a HOST running ESXi could tell me or not, I do it, eventually, when my monitoring systems warn me of a (possible) unfortunate event.
Then, everyone manage their infrastructure as they see fit (for their good reasons) and I will never question that.
Regards,
Ferdinando
