Is there a way to check all the esxi hosts on a vcenter for hardware issues?
Currently I have to go on each esxi, and click on the "hardware status" tab, to see if there are any errors.
It would be much faster if I could just run a script and output a report (html or csv) and just check that.
I currently have an esxi server with a memory issue so it's a good opportunity to test a script.
Thanks
Try something like this
foreach($esx in Get-VMHost){
$hs = Get-View -Id $esx.ExtensionData.ConfigManager.HealthStatusSystem
$hs.Runtime.SystemHealthInfo.NumericSensorInfo |
where{$_.HealthState.Label -ne 'Green' -and $_.Name -notmatch 'Rollup'} |
Select @{N='Host';E={$esx.Name}},Name,@{N='Health';E={$_.HealthState.Label}}
}
Blog: lucd.info Twitter: @LucD22 Co-author PowerCLI Reference
Under the "Health" column the output is always "Unknown". Shouldn't it say something like "Healthy" or "Faulty"?
Does it also show "unknown" in the vSphere client or Web client ?
Blog: lucd.info Twitter: @LucD22 Co-author PowerCLI Reference
When I open the "Hardware Status" tab on the vSphere Client I get a list of 772 sensors.
If there are any alarms, they are shown there.
The script can filter out "Unknown" as well
foreach($esx in Get-VMHost){
$hs = Get-View -Id $esx.ExtensionData.ConfigManager.HealthStatusSystem
$hs.Runtime.SystemHealthInfo.NumericSensorInfo |
where{$_.HealthState.Label -notmatch "Green|Unknown" -and $_.Name -notmatch 'Rollup'} |
Select @{N='Host';E={$esx.Name}},Name,@{N='Health';E={$_.HealthState.Label}}
}
Blog: lucd.info Twitter: @LucD22 Co-author PowerCLI Reference
Now it doesn't produce any output. Its like everything is unknown.
Get-VMHost lists 43 esxi servers, and one of them is the one with the faulty ram.
Run the script, without the Where-clause, against that specific ESXi host, just to check what comes out.
foreach($esx in Get-VMHost -Name <faulty-vmhost>){
$hs = Get-View -Id $esx.ExtensionData.ConfigManager.HealthStatusSystem
$hs.Runtime.SystemHealthInfo.NumericSensorInfo |
Select @{N='Host';E={$esx.Name}},Name,@{N='Health';E={$_.HealthState.Label}},Rollup
}
Blog: lucd.info Twitter: @LucD22 Co-author PowerCLI Reference
Now under the "Health" column everything is listed as green. (which is an improvement) but nothing to indicate that there's a memory problem.
When you open the Memory line with the alarm, which sensor shows the error ?
The top alarm is a rollup, which the script didn't show.
Blog: lucd.info Twitter: @LucD22 Co-author PowerCLI Reference
This is what it shows me. And when I scroll down to check all the DIMMs none of them has an alert.
Ok, that seems to be a roll-up issue then.
Since the original script skipped the roll-ups, you didn't see it.
Blog: lucd.info Twitter: @LucD22 Co-author PowerCLI Reference
Yes but now it doesn't skip the rollups right?
And again, the output of the script shows no indications that there is a problem with the server's memory.
Could this be an issue that only manifests itself in the vSphere client ?
Did you already restart your vSphere client ?
Or try with the Web client ?
Blog: lucd.info Twitter: @LucD22 Co-author PowerCLI Reference
I get the same thing in the web client
This is definitely an issue, but not really PowerCLI related as I see it.
Would a reset of the sensors be an option ?
Blog: lucd.info Twitter: @LucD22 Co-author PowerCLI Reference
The target here is that I want to replace the manual checks with this PowerCLI script.
If vCenter shows an error and PowerCLI doesn't pick it up, then PowerCLI is not a reliable solution and can't be used.
I've reset the sensors and vSphere client still shows the faulty memory.
The other thing I've noticed is that vSphere client tells me that there are 768 sensors, but the output of the script only lists 395 lines. Could this be relevant?
This is not a PowerCLI issue, moreso since we obtain the sensor data directly from the vSphere API.
The other way of obtaining the sensor readouts, via CIM SMASH, returns approximately the same number of sensors.
In fact, if you Export the Hardware Status Sensors to an XML file, you will notice that the number of entries also is approximately the same as the number returned by the earlier script.
To me it looks as if the number of sensors shown on the page is off, or they calculate that number in a different way.
Blog: lucd.info Twitter: @LucD22 Co-author PowerCLI Reference
Could there be other parameters that can be added to the script that will display additional data?
Any chance we're just not querying all the hardware on the server?
ok it seems there were other elements that were not added to the script, and that's why its not displaying the faulty memory module. I tried the below, and it showed me that it can actually detect the problem.
Can you help me add this to the script, and anything else it could be missing?