esxtop output with huge number -- error?

Matthiker · ‎03-30-2011

I have 4 ESX 4.0 hosts running around 100 VMs with vCenter and have been experiencing performance issues for a couple of weeks.

All VMs are very slow during certain period about 1 or 2 hours every day, and it changes sometimes.

After having analysized with SAN tools, esxtop tools and some scripts inside VMs, I identified the problem. It's about High Disk Avg. Latency on a Dell PS6000E Storage.

the Latency pattern presented in esxtop tools matched SAN Utility tools, sometime with high IOPS, sometime with high MB Read/sec and MB Written/sec, but not easy to identify which VM or group of VMs caused the problem.

I wrote a script on every ESX host, collecting esxtop every 10 seconds for 1 hour whenever a High "Physical Disk\(vmhba36\)\\Average Driver MilliSec" detected.

And, now I found out when it becomes slow, the output of esxtop got something interesting every time. I awk the output for Human read, only taking several measurements.

Machine Commands Reads Writes Read MBs Written MBs Latency ESX Host

VM-name1 5462836927972282368 5462836927972308992 5462836927972466688 5.20977e+12 5.20977e+12 17318.8 ESX01

VM-name2 3518437297766401024 3518437297766401024 3518437297766402048 3.35544e+12 3.35544e+12 12120.1 ESX02

helper 7.04 2334.75 2.89 2331.87 0.01 103.46 ESX01

VM-name3 49458.6 41820.9 7637.62 689.17 274.14 722.07 ESX03

VM-name4 17584.6 11340 6244.58 463.47 342 12301.6 ESX04

............

Does anyone see this before?

Matthiker · ‎03-30-2011

Please see the attached picture for formatted detailed output.

Matthiker · ‎03-30-2011

I also found the followin logs in vmkernel, generally it appears there, sometimes not.

Mar 31 08:45:59 ESXHOST01 vmkernel: 107:02:15:15.314 cpu4:4229)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x4100010d1bc0) to NMP device "naa.6090a0684007fb15e7bb94010000f075" fai
led on physical path "vmhba36:C0:T0:L0" H:0x5 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x0.
Mar 31 08:45:59 ESXHOST01 vmkernel: 107:02:15:15.314 cpu4:4229)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "naa.6090a0684007fb15e7bb94010000f075" state in doubt; reque
sted fast path state update...
Mar 31 08:45:59 ESXHOST01 vmkernel: 107:02:15:15.314 cpu4:4229)ScsiDeviceIO: 747: Command 0x28 to device "naa.6090a0684007fb15e7bb94010000f075" failed H:0x5 D:0x0 P:0x0 Possible sens
e data: 0x2 0x3a 0x0.
Mar 31 08:45:59 ESXHOST01 vmkernel: 107:02:15:15.315 cpu3:4119)J3: 2158: Aborting txn 0x4100b711c010 due to failure pre-committing: Optimistic lock acquired by another host
Mar 31 08:46:01 ESXHOST01 vmkernel: 107:02:15:17.685 cpu7:4229)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x4100010ef6c0) to NMP device "naa.6090a0684007fb15e7bb94010000f075" fai
led on physical path "vmhba36:C0:T0:L0" H:0x5 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x0.
Mar 31 08:46:01 ESXHOST01 vmkernel: 107:02:15:17.685 cpu7:4229)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "naa.6090a0684007fb15e7bb94010000f075" state in doubt; reque
sted fast path state update...
Mar 31 08:46:01 ESXHOST01 vmkernel: 107:02:15:17.685 cpu7:4229)ScsiDeviceIO: 747: Command 0x28 to device "naa.6090a0684007fb15e7bb94010000f075" failed H:0x5 D:0x0 P:0x0 Possible sens
e data: 0x2 0x3a 0x0.
Mar 31 08:46:01 ESXHOST01 vmkernel: 107:02:15:17.686 cpu3:4119)J3: 2158: Aborting txn 0x4100b711c010 due to failure pre-committing: Optimistic lock acquired by another host

idle-jam · ‎03-30-2011

perhaps the value is too much to be display on screen. what not use software like esxtop to look in more graphical manner? http://labs.vmware.com/flings/esxplot

Matthiker · ‎03-30-2011

Have tried esxplot before, this value looks like a straight line down from sky on the chart.

Matthiker · ‎03-30-2011

The problem is not how to analysing the logs, but the huge number in the output of esxtop.

Matthiker · ‎03-31-2011

nobody had encountered this?

Matthiker · ‎05-12-2011

After having upgraded to ESX 4.1, everything is fine.

Matthiker · ‎05-12-2011

DONE