VMware Cloud Community
srednausab
Enthusiast
Enthusiast

How to identify a failing physical SSD device

As part of my planned preventative maintenance, I'm looking to be able to identify SSD devices that will need to be replaced, prior to a failure.  In essence, I'd like to access predicted failure information.

Our ESXi installation is running RAID-1, but the main VM host area has no RAID

I'm running ESXi 7.0U3 on Dell XR12.

Any pointers to documents or KB articles would be welcomed as I've not had that much success finding anything.

Reply
0 Kudos
6 Replies
Sachchidanand
Expert
Expert

You may use hardware tab of ESXI to check status of the underlying hardware. Or better to use idrac to check the status of the underlying hardware.

Regards,

Sachchidanand

HassanAlKak88
Expert
Expert

Hello,

you can check the HW health status for ESXi servers from GUI or CLI, below are some references for your support:

From vSphere Client: Monitor Hardware Health Status in the vSphere Client 

From CLI: KB 2040405 


If my reply was helpful, I kindly ask you to like it and mark it as a solution

Regards,
Hassan Alkak
ansarabass
Enthusiast
Enthusiast

Hey! Identifying a failing SSD is crucial to ensure data safety and maintain a smooth operating environment, especially in an ESXi environment.

For Dell servers, the iDRAC interface is a valuable tool. It often provides predictive failure alerts for storage devices. Here's what you can do:

Dell iDRAC: Log into the iDRAC web interface and navigate to the hardware section to check the status of the SSDs. Any issues are typically flagged, including predictive failures.

ESXi: From your ESXi host, you can utilize the esxcli command to fetch storage device information. Here's a quick command:

Code
esxcli storage core device smart get -d=device_id

Look for attributes such as 'Media Wearout Indicator', 'Reallocated Sectors Count', 'Program Fail Count', etc. A significant deviation from their usual values can hint at an impending SSD failure.

vCenter Server: If you're using vCenter, it might provide alerts and notifications related to hardware health, including SSD status.

Dell OMSA (OpenManage Server Administrator): This tool provides a comprehensive health status of Dell server components, including SSDs. If it's installed on your ESXi host, it can be used to monitor hardware health.

Finally, for detailed procedures and potential alarms, check Dell's official documentation or VMware's Knowledge Base articles. Dell's community forums can also be a valuable resource, as many administrators share their experiences and solutions there.

Remember, while predictive failures give you a heads-up, it's always a good idea to maintain regular backups of crucial data.

Hope this helps and wishing you a seamless maintenance!
Cheers,
Ansar

Kinnison
Commander
Commander

Hello,


As Sachchidanand (and other) already told you, also in my opinion, the better option is to use the iDRAC, possibly setting it up to send you alarms/alerts based on the occurrence of a whole range of events related to the underlying hardware, there are more than one methods available.


However, in practice, it is information that could be somehow misleading because the intervening time from the detection of the conditions that "could lead to a malfunction" and the "malfunction" can be so short as not to have time to intervene proactively, It has concretely happened to me on a couple of occasions that not even ten minutes have passed between the "alarm" and the subsequent "fault". In a RAID array (or other by design inherently reliable solution) it is different.


Regards,
Ferdinando

maksym007
Expert
Expert

Better to use esxcli commands of course

Kinnison
Commander
Commander

Hello,


Sorry but I don't agree so much on this, receiving a warning that something "may go wrong" is quite different than realizing it when perhaps the "failure has already occurred". I don't go to consult every day and repeatedly at regular intervals via command line what a HOST running ESXi could tell me or not, I do it, eventually, when my monitoring systems warn me of a (possible) unfortunate event.


Then, everyone manage their infrastructure as they see fit (for their good reasons) and I will never question that.


Regards,
Ferdinando

Reply
0 Kudos