ozzproject
Contributor
Contributor

ESXI 6.7 very slow disk performance (extrem high latency)

Since the last update, the hard drive is extremely slow (6.7.0 Update 3 (Build 17700523)).
At the beginning the latency is still ok, then the values go extremely high.

ozzproject_0-1619982840613.png

 

See screenshot (light blue line!). Yesterday I had values up to 354000 ms.
I have already tried to deactivate the wmw-ahci driver, because it has already caused such problems.
Unfortunately, then my hard drive is no longer recognized.
Therefore I searched for alternative ahci drivers, unfortunately without success.
So I downgraded the AHCI driver to the previous version (VMW_bootbank_vmw-ahci_2.0.5-2vmw.670.3.132.17167734).
Still no better.

Any Suggestions?

Labels (2)
0 Kudos
9 Replies
paudieo
VMware Employee
VMware Employee

very hard to say what the cause is I would advise looking at the vmkernel logs  to see if you are seeing errors

also you can also look at esxtop and focus on storage performance metrics to understand any device latency or or kernel latency on the ESXi 

 

 

0 Kudos
ozzproject
Contributor
Contributor

Thanks for your help, paudieo.

I checked the log and found this error almost every 20 minutes. What does it mean?

2021-05-04T11:04:43.067Z cpu0:2097183)NMP: nmp_ThrottleLogForDevice:3872: Cmd 0x85 (0x459a40dee600, 2098995) to dev "mpx.vmhba33:C0:T0:L0" on path "vmhba33:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0. Act:NONE
2021-05-04T11:04:43.067Z cpu0:2097183)ScsiDeviceIO: 3483: Cmd(0x459a40dee600) 0x85, CmdSN 0x1416 from world 2098995 to dev "mpx.vmhba33:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0.
2021-05-04T11:04:43.067Z cpu2:2097179)ScsiDeviceIO: 3483: Cmd(0x459a40dee600) 0x85, CmdSN 0x1417 from world 2098995 to dev "mpx.vmhba33:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0.
2021-05-04T11:04:43.123Z cpu0:2097181)NMP: nmp_ThrottleLogForDevice:3872: Cmd 0x85 (0x459a40dee600, 2098995) to dev "mpx.vmhba33:C0:T0:L1" on path "vmhba33:C0:T0:L1" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0. Act:NONE
2021-05-04T11:04:43.123Z cpu0:2097181)ScsiDeviceIO: 3483: Cmd(0x459a40dee600) 0x85, CmdSN 0x1418 from world 2098995 to dev "mpx.vmhba33:C0:T0:L1" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0.
2021-05-04T11:04:43.124Z cpu0:2097181)ScsiDeviceIO: 3483: Cmd(0x459a40dee600) 0x85, CmdSN 0x1419 from world 2098995 to dev "mpx.vmhba33:C0:T0:L1" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0.
2021-05-04T11:04:43.128Z cpu0:2097183)NMP: nmp_ThrottleLogForDevice:3872: Cmd 0x85 (0x459a40dee600, 2098995) to dev "mpx.vmhba32:C0:T0:L0" on path "vmhba32:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0. Act:NONE
2021-05-04T11:04:43.128Z cpu0:2097183)ScsiDeviceIO: 3483: Cmd(0x459a40dee600) 0x85, CmdSN 0x141a from world 2098995 to dev "mpx.vmhba32:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0.
2021-05-04T11:04:43.128Z cpu0:2097183)ScsiDeviceIO: 3483: Cmd(0x459a40dee600) 0x85, CmdSN 0x141b from world 2098995 to dev "mpx.vmhba32:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0.
2021-05-04T11:05:28.030Z cpu7:2097640)NMP: nmp_ResetDeviceLogThrottling:3586: last error status from device mpx.vmhba33:C0:T0:L1 repeated 1 times
2021-05-04T11:05:28.030Z cpu7:2097640)NMP: nmp_ResetDeviceLogThrottling:3586: last error status from device mpx.vmhba33:C0:T0:L0 repeated 1 times
2021-05-04T11:05:28.030Z cpu7:2097640)NMP: nmp_ResetDeviceLogThrottling:3586: last error status from device mpx.vmhba32:C0:T0:L0 repeated 1 times
2021-05-04T11:07:28.028Z cpu7:2097695)DVFilter: 6068: Checking disconnected filters for timeouts
2021-05-04T11:17:28.017Z cpu5:2097695)DVFilter: 6068: Checking disconnected filters for timeouts
2021-05-04T11:27:28.006Z cpu5:2097695)DVFilter: 6068: Checking disconnected filters for timeouts

 

I also found another error regarding the other hard disk in the /var/log/vmkwarning.log file.

The error occurs very often and seems to occur more and more frequently and also much longer.

2021-05-01T17:07:56.340Z cpu2:2097181)WARNING: ScsiDeviceIO: 1564: Device t10.ATA_____KingFast________________________________AA000000000000000612 performance has deteriorated. I/O latency increased from average value of 797 microseconds
2021-05-01T17:07:56.340Z cpu2:2097181)WARNING: to 16610 microseconds.

0 Kudos
ozzproject
Contributor
Contributor


Thank you for your help. I have found some entries that occur often lately and also longer. What exactly does it mean?

2021-05-04T12:34:45.079Z cpu7:2097181)NMP: nmp_ThrottleLogForDevice:3872: Cmd 0x85 (0x459a40bd4c80, 2098995) to dev "mpx.vmhba33:C0:T0:L0" on path "vmhba33:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0. Act:NONE
2021-05-04T12:34:45.079Z cpu7:2097181)ScsiDeviceIO: 3483: Cmd(0x459a40bd4c80) 0x85, CmdSN 0x144c from world 2098995 to dev "mpx.vmhba33:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0.
2021-05-04T12:34:45.079Z cpu7:2097181)ScsiDeviceIO: 3483: Cmd(0x459a40bd4c80) 0x85, CmdSN 0x144d from world 2098995 to dev "mpx.vmhba33:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0.
2021-05-04T12:34:45.092Z cpu7:2097181)NMP: nmp_ThrottleLogForDevice:3872: Cmd 0x85 (0x459a40bd4c80, 2098995) to dev "mpx.vmhba33:C0:T0:L1" on path "vmhba33:C0:T0:L1" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0. Act:NONE
2021-05-04T12:34:45.092Z cpu7:2097181)ScsiDeviceIO: 3483: Cmd(0x459a40bd4c80) 0x85, CmdSN 0x144e from world 2098995 to dev "mpx.vmhba33:C0:T0:L1" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0.
2021-05-04T12:34:45.092Z cpu7:2097181)ScsiDeviceIO: 3483: Cmd(0x459a40bd4c80) 0x85, CmdSN 0x144f from world 2098995 to dev "mpx.vmhba33:C0:T0:L1" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0.
2021-05-04T12:34:45.096Z cpu7:2097181)NMP: nmp_ThrottleLogForDevice:3872: Cmd 0x85 (0x459a40bd4c80, 2098995) to dev "mpx.vmhba32:C0:T0:L0" on path "vmhba32:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0. Act:NONE
2021-05-04T12:34:45.096Z cpu7:2097181)ScsiDeviceIO: 3483: Cmd(0x459a40bd4c80) 0x85, CmdSN 0x1450 from world 2098995 to dev "mpx.vmhba32:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0.
2021-05-04T12:34:45.096Z cpu7:2097181)ScsiDeviceIO: 3483: Cmd(0x459a40bd4c80) 0x85, CmdSN 0x1451 from world 2098995 to dev "mpx.vmhba32:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0.
2021-05-04T12:35:27.933Z cpu7:2097640)NMP: nmp_ResetDeviceLogThrottling:3586: last error status from device mpx.vmhba33:C0:T0:L1 repeated 1 times
2021-05-04T12:35:27.933Z cpu7:2097640)NMP: nmp_ResetDeviceLogThrottling:3586: last error status from device mpx.vmhba33:C0:T0:L0 repeated 1 times
2021-05-04T12:35:27.933Z cpu7:2097640)NMP: nmp_ResetDeviceLogThrottling:3586: last error status from device mpx.vmhba32:C0:T0:L0 repeated 1 times


I also found another error regarding the other hard disk in the /var/log/vmkwarning.log file.
The error occurs very often and seems to occur more and more frequently and also much longer.


2021-01-29T17:26:48.021Z cpu0:2097738)WARNING: ScsiDeviceIO: 1564: Device t10.ATA_____ST2000LM0152D2E8174__________________________________WDZ1BAXV performance has deteriorated. I/O latency increased from average value of 14400
2021-01-29T17:26:48.021Z cpu0:2097738)WARNING: microseconds to 433749 microseconds.

0 Kudos
paudieo
VMware Employee
VMware Employee

2021-05-04T12:34:45.096Z cpu7:2097181)NMP: nmp_ThrottleLogForDevice:3872: Cmd 0x85 (0x459a40bd4c80, 2098995) to dev "mpx.vmhba32:C0:T0:L0" on path "vmhba32:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0. Act:NONE

means ESXi is sending a SCSI cmd 0x85 (which means its checking for if its capabile of supporting ATA PASS-THROUGH cmds)  to the device mpx.vmhba32:C0:T0:L0 , (this could be a USB / CDROM or other such virtual device)

The device, did not like the command and in return responded back with  check condition D0x2, saying its an illegal command  sense code 0x5

In summary ESXi is probing the device and the device says I can't support that command

nothing to worry about this is common log entry on esxi

see KB https://kb.vmware.com/s/article/289902 about interpreting more SCSI codes

 

However this is not good

2021-01-29T17:26:48.021Z cpu0:2097738)WARNING: ScsiDeviceIO: 1564: Device t10.ATA_____ST2000LM0152D2E8174__________________________________WDZ1BAXV performance has deteriorated. I/O latency increased from average value of 14400
2021-01-29T17:26:48.021Z cpu0:2097738)WARNING: microseconds to 433749 microseconds.

 

This means the reported device latency from this device went from 14ms to a whopping 433milliseconds 

again this is just confirming what we know

What kind of device is this t10.ATA_____ST2000LM0152D2E8174__________________________________WDZ1BAXV ?

 

 

 

 

0 Kudos
ozzproject
Contributor
Contributor

The disk ST2000LM015-2E81 is an 2000GB Seagate BarraCuda ST2000LM015 128MB 2.5" (6.4cm) SATA 6Gb/s. This run quite good in the past.

0 Kudos
paudieo
VMware Employee
VMware Employee

I see magnetic disk then

ok so there should be a SCSI controller managing the device like vmhba0 or vmhba1, to confirm if what type of scsi controller you're using

You mentioned earlier you changed the ahci driver. 

you should take a look at esxtop as well to find out what workload you are generating and the latency you're getting

https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.monitoring.doc/GUID-2F960309-58FA-4...

 

0 Kudos
ozzproject
Contributor
Contributor

I can not find a scsi controler. There only this 3 ones.

vmhba0
* is listed as Sunrise Point-LP AHCI Controller --> driver vmw_ahci

vmhba32 and vmhba33
* listed as USB Storage Controller --> driver vmkusb


vmhba0 is listed as ahci. During the last update of ESXi 6.7 the ahci driver was changed. The symptoms were the same as with the update within ESXXi 6.5, where the driver also had a problem. That was the reason to replace the driver with the previous version.
without success.
Currently I use the current ESXi 6.7 version with the included vib.


The machine on this hard disk unfortunately crashed and I have to rebuild it. But this was not possible so far because of the latency. It is a mysql server with 20 database.
The current esxtop are not meaningful.

 

DEVICE PATH/WORLD/PARTITION DQLEN WQLEN ACTV QUED %USD LOAD CMDS/s READS/s WRITES/s MBREAD/s MBWRTN/s DAVG/cmd KAVG/cmd GAVG/cmd QAVG/cmd
mpx.vmhba32:C0:T0:L0 - 1 - 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
mpx.vmhba33:C0:T0:L0 - 1 - 0 0 0 0.00 9.14 5.83 3.30 0.52 0.07 4.77 0.41 5.18 0.39
mpx.vmhba33:C0:T0:L1 - 1 - 0 0 0 0.00 101.66 0.00 101.66 0.00 1.00 0.14 0.09 0.23 0.08
t10.ATA_____KingFast________________ - 31 - 0 0 0 0.00 0.39 0.19 0.19 0.00 0.00 0.03 0.02 0.05 0.00
t10.ATA_____ST2000LM0152D2E8174_____ - 31 - 5 0 16 0.16 45.10 13.22 31.88 0.12 4.10 183.70 0.01 186.71 0.00

 

0 Kudos
paudieo
VMware Employee
VMware Employee

Hi

Looking at the HCL 

doing a search for Sunrise Point-LP

https://www.vmware.com/resources/compatibility/detail.php?deviceCategory=io&productid=42130&deviceCa...

Reading the footnotes for 6.7 U3 there was a patch see https://docs.vmware.com/en/VMware-vSphere/6.7/rn/esxi670-202103001.htm

see if it's applicable

from the esxtop output there does not seem to be much IO activity to really judge, so its not really useful

DEVICE PATH/WORLD/PARTITION DQLEN WQLEN ACTV QUED %USD LOAD CMDS/s READS/s WRITES/s MBREAD/s MBWRTN/s DAVG/cmd KAVG/cmd GAVG/cmd QAVG/cmd
t10.ATA_____ST2000LM0152D2E8174_____ - 31 - 5 0 16 0.16 45.10 13.22 31.88 0.12 4.10 183.70 0.01 186.71 0.00

however from what I see there seems to be minimal latency introduced by the ESXi itself (KAVG/cmd) while DAVG/cmd shows quite high latency (186.70)

 

I suggest is to test the performance on 6.5  on  gather the performance data (see KB https://kb.vmware.com/s/article/653
Then repeat on 6.7 U3 and gather more perf data to compare between the two versions (with the patch) 
If you have a support contract , you could engage Global Services to see if they are aware of any issues

0 Kudos
ozzproject
Contributor
Contributor

Hi paudieo,
thank you very much for the help.
I have already tried the patch...
Unfortunately also without success.

I wasted too much time in the search in the meantime.
At the beginning I had Windows in suspicion, because here some updates make problems, but the Linux machine had no abnormalities. Later, my second Esxi has become so slow that I have taken it out of service.
The other machine still runs very slow with the described problems. This is the second time I have had trouble with the drivers on Esxi. It was the same with version 6.5 U2 and now again.
I am more than disappointed.
Therefore, I have switched to ProXmox and move all the machines.

0 Kudos