Smart Information from Device used by vSAN

seamusobr1 · ‎04-29-2020

Good Morning

when I run esxcli storage core device stats get

I get the following information below

mpx.vmhba1:C2:T2:L0

Device: mpx.vmhba1:C2:T2:L0

Successful Commands: 1879424953

Blocks Read: 8194120138

Blocks Written: 7770849709

Read Operations: 946209367

Write Operations: 933147753

Reserve Operations: 3

Reservation Conflicts: 0

Failed Commands: 631

Failed Blocks Read: 6534

Failed Blocks Written: 0

Failed Read Operations: 484

Failed Write Operations: 0

Failed Reserve Operations: 0

There are indications of failed commands and failed blocks read however when I run esxcli storage core device smart get on the same device it says the health status is ok

Just wondering if this is an indication of drive failure as the ILO reports the drive as healthy

esxcli storage core device smart get -d mpx.vmhba1:C2:T2:L0

Parameter Value Threshold Worst

---------------------------- ----- --------- -----

Health Status OK N/A N/A

Media Wearout Indicator N/A N/A N/A

Write Error Count N/A N/A N/A

Read Error Count 130 39 130

Power-on Hours 100 0 100

Power Cycle Count N/A N/A N/A

Reallocated Sector Count 100 1 100

Raw Read Error Rate 130 39 130

Drive Temperature 100 1 100

Driver Rated Max Temperature N/A N/A N/A

Write Sectors TOT Count N/A N/A N/A

Read Sectors TOT Count N/A N/A N/A

Initial Bad Block Count N/A N/A N/A

Thanks in advance

TheBobkin · ‎04-29-2020

Hello Seamus,

"There are indications of failed commands and failed blocks read"

These can be benign as some commands from PSA or other layers are not supported by the end device driver/firmware and/or return a response which *may* increment these counters - a basic way of validating this is to see do you see similar counters on the other devices.

"when I run esxcli storage core device smart get on the same device it says the health status is ok"

Again, what these show depends on the end-device, but for a lot of devices (going to hazard a guess at HPE servers here) these are fairly 'painted-on' numbers and don't provide useful information for troubleshooting.

Is there a particular reason you are lkooking at this device 'mpx.vmhba1:C2:T2:L0'?

More valid indications of issues with devices would be any non-benign sense codes (e.g. it is hitting medium errors, getting reset, aborting IOs etc.) or seeing latency spikes etc. .

Bob

seamusobr1 · ‎04-29-2020

Thanks for replying TheBobkin

I just chose that drive as we were seeing alerts in vSAN

my suspicion is that they are related to driver issues as the environment has not been patched for well over a year but we cannot patch it due to a Corvid change freeze and our call centre runs on it

It is running 6.5 7388607 also the firmware is well out of date so I was going to wait until the change freeze was over and get it up dated to 6.7

We are seeing the error below

LSOMEventNotify:6956: Virtual SAN device 52dabb19-aa43-e70d-3aab-0e4f63bb13c7 has gone offline.

but it clears

TheBobkin · ‎04-29-2020

Hello Seamus,

Yes indeed could be controller issues if the device is being Power-on reset etc. - can you share/PM the vmkernel.log from that time to take a look?

Bob

All

Smart Information from Device used by vSAN