VMware Cloud Community
Argyle
Enthusiast
Enthusiast

Path Trashing. How to identify? Got slow disk performance with file copy.

Hi.

Where should I start looking to identify I/O issues and what commands and tools are available in ESX? I've checked vmkwarning and vmkernal logs and haven't found any specific IO errors there.

I suspect I have an issue with path trashing or other SAN or HBA related issue.

My problem is that when I (in a Windows 2000 or 2003 VM) copy say a 500 MB file from a LUN to another or even from and to the exact same LUN (in the same VM) the copy process freezes for a few seconds now and then and the copy is slow. The entire Guest OS freezes too. You can't click on other explorer folders etc. I've done the file copy within the ESX service console and the process is slow there as well.

On some LUNs the copy process is ok but not on others, that's why I suspect path trashing but I'm not sure how to identify if it's occurring. Doing some tests with sqlio.exe it shows a noticeable difference in sequential reads. Random and smaller reads look about the same.

System:[/b]

Hardware:______6 x HP ProLiant BL25p G1

CPU:__________2 x 2.4 GHz AMD Opteron 280 (Dual core)

RAM:__________16 GB

Enclosure:______Enhanced p-Class Server Blade Enclosure

HBA:__________2 x QLogic Corp QLA2312/2340 (rev 02)

ESX: __________3.01 with patches up till 2006-12-28.

SAN:__________HP EVA5000 with VCS 3.028 active/passive with "CUSTOM type 000000002200282E"

Multipath:______MRU

esxcfg-mpath -l output:[/b]

\----


Disk vmhba0:0:0 /dev/cciss/c0d0 (69459MB) has 1 paths and policy of Fixed

Local 3:2.0 vmhba0:0:0 On active preferred

RAID Controller (SCSI-3) vmhba1:0:0 (0MB) has 4 paths and policy of Most Recently Used

FC 6:1.0 50060b0000885b52<->50001fe150027d3c vmhba1:0:0 On active preferred

FC 6:1.0 50060b0000885b52<->50001fe150027d38 vmhba1:1:0 Standby

FC 6:1.1 50060b0000885b53<->50001fe150027d3d vmhba2:0:0 Standby

FC 6:1.1 50060b0000885b53<->50001fe150027d39 vmhba2:1:0 Standby

Disk vmhba1:0:1 /dev/sda (102400MB) has 4 paths and policy of Most Recently Used

FC 6:1.0 50060b0000885b52<->50001fe150027d3c vmhba1:0:1 On active preferred

FC 6:1.0 50060b0000885b52<->50001fe150027d38 vmhba1:1:1 Standby

FC 6:1.1 50060b0000885b53<->50001fe150027d3d vmhba2:0:1 On

FC 6:1.1 50060b0000885b53<->50001fe150027d39 vmhba2:1:1 Standby

Disk vmhba1:0:2 /dev/sdb (153600MB) has 4 paths and policy of Most Recently Used

FC 6:1.0 50060b0000885b52<->50001fe150027d3c vmhba1:0:2 On active preferred

FC 6:1.0 50060b0000885b52<->50001fe150027d38 vmhba1:1:2 Standby

FC 6:1.1 50060b0000885b53<->50001fe150027d3d vmhba2:0:2 On

FC 6:1.1 50060b0000885b53<->50001fe150027d39 vmhba2:1:2 Standby

Disk vmhba1:0:3 /dev/sdc (102400MB) has 4 paths and policy of Most Recently Used

FC 6:1.0 50060b0000885b52<->50001fe150027d3c vmhba1:0:3 On active preferred

FC 6:1.0 50060b0000885b52<->50001fe150027d38 vmhba1:1:3 Standby

FC 6:1.1 50060b0000885b53<->50001fe150027d3d vmhba2:0:3 On

FC 6:1.1 50060b0000885b53<->50001fe150027d39 vmhba2:1:3 Standby

Disk vmhba1:0:4 /dev/sdd (102400MB) has 4 paths and policy of Most Recently Used

FC 6:1.0 50060b0000885b52<->50001fe150027d3c vmhba1:0:4 On active preferred

FC 6:1.0 50060b0000885b52<->50001fe150027d38 vmhba1:1:4 Standby

FC 6:1.1 50060b0000885b53<->50001fe150027d3d vmhba2:0:4 On

FC 6:1.1 50060b0000885b53<->50001fe150027d39 vmhba2:1:4 Standby

Disk vmhba1:0:5 /dev/sde (153600MB) has 4 paths and policy of Most Recently Used

FC 6:1.0 50060b0000885b52<->50001fe150027d3c vmhba1:0:5 On active preferred

FC 6:1.0 50060b0000885b52<->50001fe150027d38 vmhba1:1:5 Standby

FC 6:1.1 50060b0000885b53<->50001fe150027d3d vmhba2:0:5 On

FC 6:1.1 50060b0000885b53<->50001fe150027d39 vmhba2:1:5 Standby

Disk vmhba1:0:6 /dev/sdf (153600MB) has 4 paths and policy of Most Recently Used

FC 6:1.0 50060b0000885b52<->50001fe150027d3c vmhba1:0:6 On active preferred

FC 6:1.0 50060b0000885b52<->50001fe150027d38 vmhba1:1:6 Standby

FC 6:1.1 50060b0000885b53<->50001fe150027d3d vmhba2:0:6 On

FC 6:1.1 50060b0000885b53<->50001fe150027d39 vmhba2:1:6 Standby

There is no load on the servers or the LUNs dedicated for this new cluster. All blades experience the same issue. The SAN is used by other systems.

I've run SQLIO.exe from Microsoft to test performance on a 8 GB file within a Windows 2003 VM. The last 4 columns are latency time in ms.

Performance some days ago which was ok:[/b]

Test Type________________IO/s______MB/s______Min(ms)______Avg(ms)_____Max(ms)____>24ms (%)

Read 8KB random_________7218______56__________0___________4__________596__________1

Read 64KB random________2840_____179__________0__________10__________416__________3

Read 128KB random_______1424_____178__________2__________21__________575_________27

Read 256KB random________708_____177__________4__________44__________223_________99

Read 8KB sequential______11775______92__________0___________2__________213__________1

Read 64KB sequential______2761_____173__________0__________11__________240__________3

Read 128KB sequential_____1366_____170__________2__________22__________128_________17

Read 256KB sequential______665_____166_________12__________47__________158________100

Performance now with slow file copy:[/b]

Test Type________________IO/s______MB/s______Min(ms)______Avg(ms)_____Max(ms)____>24ms (%)

Read 8KB random_________6177______48__________0___________4__________926__________3

Read 64KB random________2444_____152__________0__________12_________1355__________5

Read 128KB random_______1302_____162__________2__________24_________1105_________35

Read 256KB random________250______64__________4_________123_________1949_________98

Read 8KB sequential_______2450______19__________0_________[b]12_________1706_________12[/b]

Read 64KB sequential______1130______70__________0_________[b]27_________1187_________67[/b]

Read 128KB sequential______482______60__________2_________[b]65__________818________100[/b]

Read 256KB sequential_______92______73_________34________[b]108__________901________100[/b]

The most noticeable here is the slower sequential read and the much higher latency time in ms, and the percent of waits that are longer than 24ms.

Any suggestions on what counters or logs to look at when it comes to SAN array controllers, Switches and HBAs?

Reply
0 Kudos
18 Replies
bretti
Expert
Expert

Path thrashing is hard to catch from a host point of view. What does the SAN side of things see? Do you have any directors or supervisors failing on your SAN?

You might also try esxtop, switch to the disk mode by pressind "d".

KnowItAll
Hot Shot
Hot Shot

Have you called HP to find out of they have a newer version of firmware for your EVA?

Are you using 4Gb or 2Gb switches?

Check vmkernel and vmkwarning for 24/0/0x0/0x00/0x00 scsi reservation warnings.

christianZ
Champion
Champion

Just a simple hint -

Are these Vms in production - have you then checked the fragmentation ?

Reply
0 Kudos
Argyle
Enthusiast
Enthusiast

Path thrashing is hard to catch from a host point of

view. What does the SAN side of things see? Do you

have any directors or supervisors failing on your

SAN?

Not sure I know what directors and supervisors are when it comes to SAN. What I can see on the SAN side via performance monitor counters is that all traffic come in on the same storage processor. It is not switching between the two that exist.

You might also try esxtop, switch to the disk mode by

pressind "d".

I'll be going trought all the ESXes to see if I can identify any high usage here.

\----


Have you called HP to find out of they have a newer

version of firmware for your EVA?

Newer version exist (it would switch the controllers over to active/active instead of active/passive) but it has worked fine earlier with the current version.

Are you using 4Gb or 2Gb switches?

We have 2Gb switches

Check vmkernel and vmkwarning for 24/0/0x0/0x00/0x00

scsi reservation warnings.

I can't see any of those but I find this in vmkernel:

WARNING: SCSI: 5663: vmhba2:0:4:1 status = 0/2 0x0 0x0 0x0

According to: http://www.vmprofessional.com/index.php?content=resources

it means:

Host Stauts: Host BUS busy: BUS stayed busy through time out period

\----


Just a simple hint -

Are these Vms in production - have you then checked

the fragmentation ?

These are new VMs that are defragged.

Thx for the help so far.

Reply
0 Kudos
BUGCHK
Commander
Commander

A directory[/b] is a big, highly redundant Fibre Channel switch.

A control module in CISCO switch is called the supervisor[/b].

Reply
0 Kudos
bretti
Expert
Expert

Sorry, that was an EMC and Cisco terminology. You got it when you mentioned storage processor.

I'm not familiar with HP EVA's. Can you use fixed multipathing instead of MRU.

With the EMC Clarion line, they prefer MRU and can cause problems if set to fixed. Does the HP EVA have the same issue?

Reply
0 Kudos
VirtualKenneth
Virtuoso
Virtuoso

You can with HP EVA's but the use depends on the either active/active or active/passive storage controllers.

Reply
0 Kudos
bretti
Expert
Expert

It looks like Argyle is using Active/Passive. How easy is it to trespass or switch everything over to the Passive supervisor on the EVA?

That may be a good troubleshooting step to eliminate the active supervisor.

Reply
0 Kudos
VirtualKenneth
Virtuoso
Virtuoso

You can set a preferred path on the EVA but not a preferred controller afaik

Reply
0 Kudos
VirtualKenneth
Virtuoso
Virtuoso

I've been rereading this thread and I noticed that you are using the Fixed Policy while running on an active/passive controller, this is not the way to go obviously. You can't use fixed policy's with a/p

You can enable it and set the fixed path via command line only ESX doesn't use this!... This option is greyed out in Virtual Center so it's sort of a bug that you can enable it via CLI and not via VC.

Reply
0 Kudos
BUGCHK
Commander
Commander

You can set a preferred path on the EVA but not a preferred controller afaik

The preferrence setting is per-Vdisk.

I noticed that you are using the Fixed Policy

Look again Smiley Wink

Disk vmhba0:0:0 /dev/[b]cciss[/b]/c0d0 (69459MB) has 1 paths[/b] and policy of Fixed

That is a SmartArray PCI backplane or on-motherboard RAID controller.

Reply
0 Kudos
VirtualKenneth
Virtuoso
Virtuoso

The preferrence setting is per-Vdisk.

True

Look again Smiley Wink

uhm......... ahum..... \*oeps* looked at the DAS storage sentence only...

Reply
0 Kudos
Argyle
Enthusiast
Enthusiast

Yea its MRU. We're still not close to solving the issue.

File copy on ESX level is slow as well. A simple cp of a 500 MB iso (from and to the same lun) takes 130 seconds that normally takes 10 seconds.

Have a case opened with VMware. Will try to relaod the HBA driver and if it doesn't help reinstall one of the ESX servers with a clean 3.01 incase any patch have reset some drivers or settings related to storage.

Reply
0 Kudos
Argyle
Enthusiast
Enthusiast

This is now solved with help from VMware. A reinstall without any 3.01 ESX patches solved the issue. It has nothing to do with path trashing.

The 4th January we installed four patches:

ESX-2066306 13:13:49 01/04/07 Patch for VM crashes and possible freeze

ESX-6921838 13:16:12 01/04/07 hot removal of a virtual disk thru SDK

ESX-8173580 13:22:54 01/04/07 Fix COS Oops running Dell OM5 w/ QLogic

ESX-9986131 13:30:05 01/04/07 Updated openssh, python, and openssl

The 4th January these issues started to occur as well.

This week we will try do identify which patch caused this or if it was in combination with something else on the server. It might not be one of the patches alone but it seems likely.

There is an easy way to identify the issue. Run "vmkfstools -i" to copy a vmdk file. Note that vmkfstools has to be used to actually test the copy via vm kernel. A standard "cp" command will not work.

Command:

\----


\[root@myesxhost01 /]# vmkfstools -i /vmfs/volumes/System-Disk-01/MYVM01/MYVM01.vmdk -d 2gbsparse /vmfs/volumes/System-Disk-01/MYVM01/test.vmdk

Destination disk format: sparse with 2GB maximum extent size Cloning disk '/vmfs/volumes/System-Disk-01/MYVM01/MYVM01.vmdk'...

Clone: 10% done.

In another SSH session run "tail -f -s2 /var/log/vmkernel"

There you will see lots of errors in the format:

Feb 2 11:52:23 myesxhost01 vmkernel: 29:03:06:57.614 cpu2:1034)SCSI: 8021: vmhba1:0:7:1 status = 8/0 0x0 0x0 0x0

Feb 2 11:52:23 myesxhost01 vmkernel: 29:03:06:57.614 cpu2:1034)SCSI: 8040: vmhba1:0:7:1 Retry (busy)

Feb 2 11:52:23 myesxhost01 vmkernel: 29:03:06:57.814 cpu3:1027)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK

Feb 2 11:52:23 myesxhost01 vmkernel: 29:03:06:57.814 cpu3:1027)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Reply
0 Kudos
Argyle
Enthusiast
Enthusiast

Hmm, I tried to mark multiple replies in this thread as helpful but no longer have the option after marking two posts.

Reply
0 Kudos
VirtualKenneth
Virtuoso
Virtuoso

Correct, you can grant out 2x helpful (6 points) and 1x correct (10 points)

Reply
0 Kudos
richie_harcourt
Contributor
Contributor

Hi Argyle,

I've got a customer who is experiencing exact same scsi busy errors and guest os io pauses, I'm very curious if you actually pinned this fault of your to a specific patch or some other fault in the end?

Thanks, Richard Harcourt

Reply
0 Kudos
Argyle
Enthusiast
Enthusiast

No we could not reproduce the error after reinstall.

The only difference I could see was that the old servers where 3.0.0 that where upgraded to 3.0.1 then patched.

On the next reinstall we used a 3.0.1 iso then patched.

With the vmkfstools command I posted above you can see if you have the specific issue though.

Reply
0 Kudos