VMware Cloud Community
NucleusVM
Enthusiast
Enthusiast

checking esxi hardware for problems using powercli

Is there a way to check all the esxi hosts on a vcenter for hardware issues?

Currently I have to go on each esxi, and click on the "hardware status" tab, to see if there are any errors.

It would be much faster if I could just run a script and output a report (html or csv) and just check that.

I currently have an esxi server with a memory issue so it's a good opportunity to test a script.

Thanks

0 Kudos
26 Replies
LucD
Leadership
Leadership

Try something like this

foreach($esx in Get-VMHost){

    $hs = Get-View -Id $esx.ExtensionData.ConfigManager.HealthStatusSystem

    $hs.Runtime.SystemHealthInfo.NumericSensorInfo |

    where{$_.HealthState.Label -ne 'Green' -and $_.Name -notmatch 'Rollup'} |

    Select @{N='Host';E={$esx.Name}},Name,@{N='Health';E={$_.HealthState.Label}}

}


Blog: lucd.info  Twitter: @LucD22  Co-author PowerCLI Reference

0 Kudos
NucleusVM
Enthusiast
Enthusiast

Under the "Health" column the output is always "Unknown". Shouldn't it say something like "Healthy" or "Faulty"?

0 Kudos
LucD
Leadership
Leadership

Does it also show "unknown" in the vSphere client or Web client ?


Blog: lucd.info  Twitter: @LucD22  Co-author PowerCLI Reference

0 Kudos
NucleusVM
Enthusiast
Enthusiast

When I open the "Hardware Status" tab on the vSphere Client I get a list of 772 sensors.

If there are any alarms, they are shown there.

0 Kudos
LucD
Leadership
Leadership

The script can filter out "Unknown" as well

foreach($esx in Get-VMHost){

    $hs = Get-View -Id $esx.ExtensionData.ConfigManager.HealthStatusSystem

    $hs.Runtime.SystemHealthInfo.NumericSensorInfo |

    where{$_.HealthState.Label -notmatch "Green|Unknown" -and $_.Name -notmatch 'Rollup'} |

    Select @{N='Host';E={$esx.Name}},Name,@{N='Health';E={$_.HealthState.Label}}

}


Blog: lucd.info  Twitter: @LucD22  Co-author PowerCLI Reference

0 Kudos
NucleusVM
Enthusiast
Enthusiast

Now it doesn't produce any output. Its like everything is unknown.

Get-VMHost lists 43 esxi servers, and one of them is the one with the faulty ram.

1.jpg

0 Kudos
LucD
Leadership
Leadership

Run the script, without the Where-clause, against that specific ESXi host, just to check what comes out.

foreach($esx in Get-VMHost -Name <faulty-vmhost>){

    $hs = Get-View -Id $esx.ExtensionData.ConfigManager.HealthStatusSystem

    $hs.Runtime.SystemHealthInfo.NumericSensorInfo |

    Select @{N='Host';E={$esx.Name}},Name,@{N='Health';E={$_.HealthState.Label}},Rollup

}


Blog: lucd.info  Twitter: @LucD22  Co-author PowerCLI Reference

0 Kudos
NucleusVM
Enthusiast
Enthusiast

Now under the "Health" column everything is listed as green. (which is an improvement) but nothing to indicate that there's a memory problem.

0 Kudos
LucD
Leadership
Leadership

When you open the Memory line with the alarm, which sensor shows the error ?

The top alarm is a rollup, which the script didn't show.


Blog: lucd.info  Twitter: @LucD22  Co-author PowerCLI Reference

0 Kudos
NucleusVM
Enthusiast
Enthusiast

2.jpg

This is what it shows me. And when I scroll down to check all the DIMMs none of them has an alert.

0 Kudos
LucD
Leadership
Leadership

Ok, that seems to be a roll-up issue then.

Since the original script skipped the roll-ups, you didn't see it.


Blog: lucd.info  Twitter: @LucD22  Co-author PowerCLI Reference

0 Kudos
NucleusVM
Enthusiast
Enthusiast

Yes but now it doesn't skip the rollups right?

And again, the output of the script shows no indications that there is a problem with the server's memory.

0 Kudos
LucD
Leadership
Leadership

Could this be an issue that only manifests itself in the vSphere client ?

Did you already restart your vSphere client ?

Or try with the Web client ?


Blog: lucd.info  Twitter: @LucD22  Co-author PowerCLI Reference

0 Kudos
NucleusVM
Enthusiast
Enthusiast

I get the same thing in the web client

3.jpg

0 Kudos
LucD
Leadership
Leadership

This is definitely an issue, but not really PowerCLI related as I see it.

Would a reset of the sensors be an option ?


Blog: lucd.info  Twitter: @LucD22  Co-author PowerCLI Reference

0 Kudos
NucleusVM
Enthusiast
Enthusiast

The target here is that I want to replace the manual checks with this PowerCLI script.

If vCenter shows an error and PowerCLI doesn't pick it up, then PowerCLI is not a reliable solution and can't be used.

I've reset the sensors and vSphere client still shows the faulty memory.

The other thing I've noticed is that vSphere client tells me that there are 768 sensors, but the output of the script only lists 395 lines. Could this be relevant?

0 Kudos
LucD
Leadership
Leadership

This is not a PowerCLI issue, moreso since we obtain the sensor data directly from the vSphere API.

The other way of obtaining the sensor readouts, via CIM SMASH, returns approximately the same number of sensors.

In fact, if you Export the Hardware Status Sensors to an XML file, you will notice that the number of entries also is approximately the same as the number returned by the earlier script.

To me it looks as if the number of sensors shown on the page is off, or they calculate that number in  a different way.


Blog: lucd.info  Twitter: @LucD22  Co-author PowerCLI Reference

0 Kudos
NucleusVM
Enthusiast
Enthusiast

Could there be other parameters that can be added to the script that will display additional data?

Any chance we're just not querying all the hardware on the server?

0 Kudos
NucleusVM
Enthusiast
Enthusiast

ok it seems there were other elements that were not added to the script, and that's why its not displaying the faulty memory module. I tried the below, and it showed me that it can actually detect the problem.

Can you help me add this to the script, and anything else it could be missing?

1.jpg

0 Kudos