VMware Cloud Community
seamusobr1
Enthusiast
Enthusiast

Smart Information from Device used by vSAN

Good Morning

when I run esxcli storage core device stats get

I get the following information below

mpx.vmhba1:C2:T2:L0

   Device: mpx.vmhba1:C2:T2:L0

   Successful Commands: 1879424953

   Blocks Read: 8194120138

   Blocks Written: 7770849709

   Read Operations: 946209367

   Write Operations: 933147753

   Reserve Operations: 3

   Reservation Conflicts: 0

   Failed Commands: 631

   Failed Blocks Read: 6534

   Failed Blocks Written: 0

   Failed Read Operations: 484

   Failed Write Operations: 0

   Failed Reserve Operations: 0

There are indications of failed commands and failed blocks read however when I run esxcli storage core device smart get on the same device it says the health status is ok

Just wondering if this is an indication of drive failure as the ILO reports the drive as healthy

esxcli storage core device smart get -d mpx.vmhba1:C2:T2:L0

Parameter                     Value  Threshold  Worst

----------------------------  -----  ---------  -----

Health Status                 OK     N/A        N/A

Media Wearout Indicator       N/A    N/A        N/A

Write Error Count             N/A    N/A        N/A

Read Error Count              130    39         130

Power-on Hours                100    0          100

Power Cycle Count             N/A    N/A        N/A

Reallocated Sector Count      100    1          100

Raw Read Error Rate           130    39         130

Drive Temperature             100    1          100

Driver Rated Max Temperature  N/A    N/A        N/A

Write Sectors TOT Count       N/A    N/A        N/A

Read Sectors TOT Count        N/A    N/A        N/A

Initial Bad Block Count       N/A    N/A        N/A

Thanks in advance

0 Kudos
3 Replies
TheBobkin
Champion
Champion

Hello Seamus,

"There are indications of failed commands and failed blocks read"

These can be benign as some commands from PSA or other layers are not supported by the end device driver/firmware and/or return a response which *may* increment these counters - a basic way of validating this is to see do you see similar counters on the other devices.

"when I run esxcli storage core device smart get on the same device it says the health status is ok"

Again, what these show depends on the end-device, but for a lot of devices (going to hazard a guess at HPE servers here) these are fairly 'painted-on' numbers and don't provide useful information for troubleshooting.

Is there a particular reason you are lkooking at this device 'mpx.vmhba1:C2:T2:L0'?

More valid indications of issues with devices would be any non-benign sense codes (e.g. it is hitting medium errors, getting reset, aborting IOs etc.) or seeing latency spikes etc. .

Bob

0 Kudos
seamusobr1
Enthusiast
Enthusiast

Thanks for replying TheBobkin

I just chose that drive as we were seeing alerts in vSAN

my suspicion is that they are related to driver issues as the environment has not been patched for well over a year but we cannot patch it due to a Corvid change freeze and our call centre runs on it

It is running 6.5 7388607 also the firmware is well out of date so I was going to wait until the change freeze was over and get it up dated to 6.7

We are seeing the error below

LSOMEventNotify:6956: Virtual SAN device 52dabb19-aa43-e70d-3aab-0e4f63bb13c7 has gone offline.

but it clears

0 Kudos
TheBobkin
Champion
Champion

Hello Seamus,

Yes indeed could be controller issues if the device is being Power-on reset etc. - can you share/PM the vmkernel.log from that time to take a look?

Bob

0 Kudos