I have 4 ESX 4.0 hosts running around 100 VMs with vCenter and have been experiencing performance issues for a couple of weeks.
All VMs are very slow during certain period about 1 or 2 hours every day, and it changes sometimes.
After having analysized with SAN tools, esxtop tools and some scripts inside VMs, I identified the problem. It's about High Disk Avg. Latency on a Dell PS6000E Storage.
the Latency pattern presented in esxtop tools matched SAN Utility tools, sometime with high IOPS, sometime with high MB Read/sec and MB Written/sec, but not easy to identify which VM or group of VMs caused the problem.
I wrote a script on every ESX host, collecting esxtop every 10 seconds for 1 hour whenever a High "Physical Disk\(vmhba36\)\\Average Driver MilliSec" detected.
And, now I found out when it becomes slow, the output of esxtop got something interesting every time. I awk the output for Human read, only taking several measurements.
Machine Commands Reads Writes Read MBs Written MBs Latency ESX Host
VM-name1 5462836927972282368 5462836927972308992 5462836927972466688 5.20977e+12 5.20977e+12 17318.8 ESX01
VM-name2 3518437297766401024 3518437297766401024 3518437297766402048 3.35544e+12 3.35544e+12 12120.1 ESX02
helper 7.04 2334.75 2.89 2331.87 0.01 103.46 ESX01
VM-name3 49458.6 41820.9 7637.62 689.17 274.14 722.07 ESX03
VM-name4 17584.6 11340 6244.58 463.47 342 12301.6 ESX04
............
Does anyone see this before?
I also found the followin logs in vmkernel, generally it appears there, sometimes not.
Mar 31 08:45:59 ESXHOST01 vmkernel: 107:02:15:15.314 cpu4:4229)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x4100010d1bc0) to NMP device "naa.6090a0684007fb15e7bb94010000f075" fai
led on physical path "vmhba36:C0:T0:L0" H:0x5 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x0.
Mar 31 08:45:59 ESXHOST01 vmkernel: 107:02:15:15.314 cpu4:4229)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "naa.6090a0684007fb15e7bb94010000f075" state in doubt; reque
sted fast path state update...
Mar 31 08:45:59 ESXHOST01 vmkernel: 107:02:15:15.314 cpu4:4229)ScsiDeviceIO: 747: Command 0x28 to device "naa.6090a0684007fb15e7bb94010000f075" failed H:0x5 D:0x0 P:0x0 Possible sens
e data: 0x2 0x3a 0x0.
Mar 31 08:45:59 ESXHOST01 vmkernel: 107:02:15:15.315 cpu3:4119)J3: 2158: Aborting txn 0x4100b711c010 due to failure pre-committing: Optimistic lock acquired by another host
Mar 31 08:46:01 ESXHOST01 vmkernel: 107:02:15:17.685 cpu7:4229)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x4100010ef6c0) to NMP device "naa.6090a0684007fb15e7bb94010000f075" fai
led on physical path "vmhba36:C0:T0:L0" H:0x5 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x0.
Mar 31 08:46:01 ESXHOST01 vmkernel: 107:02:15:17.685 cpu7:4229)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "naa.6090a0684007fb15e7bb94010000f075" state in doubt; reque
sted fast path state update...
Mar 31 08:46:01 ESXHOST01 vmkernel: 107:02:15:17.685 cpu7:4229)ScsiDeviceIO: 747: Command 0x28 to device "naa.6090a0684007fb15e7bb94010000f075" failed H:0x5 D:0x0 P:0x0 Possible sens
e data: 0x2 0x3a 0x0.
Mar 31 08:46:01 ESXHOST01 vmkernel: 107:02:15:17.686 cpu3:4119)J3: 2158: Aborting txn 0x4100b711c010 due to failure pre-committing: Optimistic lock acquired by another host
perhaps the value is too much to be display on screen. what not use software like esxtop to look in more graphical manner? http://labs.vmware.com/flings/esxplot
Have tried esxplot before, this value looks like a straight line down from sky on the chart.
The problem is not how to analysing the logs, but the huge number in the output of esxtop.
nobody had encountered this?
After having upgraded to ESX 4.1, everything is fine.
DONE