Good Morning
when I run esxcli storage core device stats get
I get the following information below
mpx.vmhba1:C2:T2:L0
Device: mpx.vmhba1:C2:T2:L0
Successful Commands: 1879424953
Blocks Read: 8194120138
Blocks Written: 7770849709
Read Operations: 946209367
Write Operations: 933147753
Reserve Operations: 3
Reservation Conflicts: 0
Failed Commands: 631
Failed Blocks Read: 6534
Failed Blocks Written: 0
Failed Read Operations: 484
Failed Write Operations: 0
Failed Reserve Operations: 0
There are indications of failed commands and failed blocks read however when I run esxcli storage core device smart get on the same device it says the health status is ok
Just wondering if this is an indication of drive failure as the ILO reports the drive as healthy
esxcli storage core device smart get -d mpx.vmhba1:C2:T2:L0
Parameter Value Threshold Worst
---------------------------- ----- --------- -----
Health Status OK N/A N/A
Media Wearout Indicator N/A N/A N/A
Write Error Count N/A N/A N/A
Read Error Count 130 39 130
Power-on Hours 100 0 100
Power Cycle Count N/A N/A N/A
Reallocated Sector Count 100 1 100
Raw Read Error Rate 130 39 130
Drive Temperature 100 1 100
Driver Rated Max Temperature N/A N/A N/A
Write Sectors TOT Count N/A N/A N/A
Read Sectors TOT Count N/A N/A N/A
Initial Bad Block Count N/A N/A N/A
Thanks in advance
Hello Seamus,
"There are indications of failed commands and failed blocks read"
These can be benign as some commands from PSA or other layers are not supported by the end device driver/firmware and/or return a response which *may* increment these counters - a basic way of validating this is to see do you see similar counters on the other devices.
"when I run esxcli storage core device smart get on the same device it says the health status is ok"
Again, what these show depends on the end-device, but for a lot of devices (going to hazard a guess at HPE servers here) these are fairly 'painted-on' numbers and don't provide useful information for troubleshooting.
Is there a particular reason you are lkooking at this device 'mpx.vmhba1:C2:T2:L0'?
More valid indications of issues with devices would be any non-benign sense codes (e.g. it is hitting medium errors, getting reset, aborting IOs etc.) or seeing latency spikes etc. .
Bob
Thanks for replying TheBobkin
I just chose that drive as we were seeing alerts in vSAN
my suspicion is that they are related to driver issues as the environment has not been patched for well over a year but we cannot patch it due to a Corvid change freeze and our call centre runs on it
It is running 6.5 7388607 also the firmware is well out of date so I was going to wait until the change freeze was over and get it up dated to 6.7
We are seeing the error below
LSOMEventNotify:6956: Virtual SAN device 52dabb19-aa43-e70d-3aab-0e4f63bb13c7 has gone offline.
but it clears
Hello Seamus,
Yes indeed could be controller issues if the device is being Power-on reset etc. - can you share/PM the vmkernel.log from that time to take a look?
Bob