VMware Cloud Community
computerguy7
Enthusiast
Enthusiast

HP VSA 11.0 (2014) - Bad storage performance

2 hosts: Dell PowerEdge 720xd running vsphere 5.5 update 1

10 disks per host. 10k RPM 1.2 TB 2.5inch

One HP VSA per host, 2 vCPU and 8 GB memory per VSA.

All the latest updates for VSA. Almost all the latest vmware updates (they keep releasing more after my update cycle)

2014-05-02 16_42_43-pe720-host1 - PuTTY.png

my disk latency for a VM (when producing a significant storage load) is very bad. My HP storeVirtual hardware under similar load functions just fine. seems to be something related to VSA + VMware. the vmhba0 (dell perc card) seems to perform fine.

the kavg seems high. this article says kavg is when the vmkernel (VMware) is processing a command.

VMware KB: Using esxtop to identify storage performance issues for ESX / ESXi (multiple versions)  http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100820...

Thoughts on what I may have configured wrong or how I should proceed with troubleshooting?

Also, if I use IO meter and test with high IOPS but small writes. everything performs good. If I choose large writes (4 mb) everything performs like crap.

Reply
0 Kudos
18 Replies
computerguy7
Enthusiast
Enthusiast

my hp VSA vmware.log file has this:

2014-05-07T21:57:25.574Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xf6

2014-05-07T21:58:35.563Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xf9

2014-05-07T21:58:35.563Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xd0

2014-05-07T21:58:35.563Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xda

2014-05-07T21:58:35.563Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xe4

2014-05-07T21:58:35.563Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xe6

2014-05-07T21:58:35.563Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xd4

2014-05-07T21:58:35.564Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xdb

2014-05-07T21:58:35.564Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xf6

2014-05-07T21:58:35.564Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xfb

2014-05-07T21:58:35.564Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xf4

2014-05-07T21:58:35.564Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xdf

2014-05-07T21:58:35.564Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xe1

2014-05-07T21:58:35.564Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xfe

2014-05-07T21:58:35.564Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xe7

2014-05-07T21:58:35.564Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xff

2014-05-07T21:58:35.564Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xe9

2014-05-07T21:58:35.565Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xef

2014-05-07T21:58:35.565Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xf5

2014-05-07T21:58:35.565Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xfa

2014-05-07T21:58:35.565Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xdc

2014-05-07T21:58:35.565Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xe2

2014-05-07T21:58:35.565Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xe5

2014-05-07T21:58:35.565Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xd5

2014-05-07T21:58:35.565Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xf8

2014-05-07T21:58:35.565Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xeb

2014-05-07T21:58:35.565Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0x100

2014-05-07T21:58:35.566Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xd6

2014-05-07T21:58:35.566Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xd1

2014-05-07T21:58:35.566Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xee

2014-05-07T21:58:35.566Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xcf

2014-05-07T21:58:35.566Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xe0

2014-05-07T21:58:35.566Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xd3

2014-05-07T21:58:35.566Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xea

2014-05-07T21:58:35.566Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xf0

2014-05-07T21:58:35.566Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xfc

2014-05-07T21:58:35.566Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xe3

2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xfd

2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xec

2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xf1

2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xde

2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xe8

2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xd9

2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xd2

2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xd7

2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xd8

2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xf3

2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xf7

2014-05-07T21:58:35.568Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xf2

definitely related to my issue.

related hp vsa lefthand logs:

store.info:May  2 18:25:21 (VM-NAME) dbd_store[2982]: store_0::DBD_STORE:device='/dev/disk/by-id/scsi-36000c299a9e43ad9c456c1e50781790b-part2' recent_max_lat=0.000s  ops_out=82 oldest_op_out=10.653s write (excessive)

Event Log (369) [01000000]: 2014 Apr 28 20:49:50 (VM-NAME): E00060100 SAN/iQ System Storage system '(VM-NAME)' latency = 66.376, exceeds 60.000.

Event Log (370) [01000000]: 2014 Apr 28 20:51:48 (VM-NAME): E40060100 SAN/iQ System Storage device name '/dev/disk/by-id/scsi-36000c299a9e43ad9c456c1e50781790b-part2': I/O ERROR - error count of 1.

hopefully adding all these logs will help someone someday.

Reply
0 Kudos
admin
Immortal
Immortal

Try updating the nic driver, it might fix your issue.

Reply
0 Kudos
vAnswers
Contributor
Contributor

I am still using older version.. planning to upgrade.. Will check the issue

Reply
0 Kudos
computerguy7
Enthusiast
Enthusiast

I've learned that we can best reproduce the errors and storage delays when running a robocopy task that generates a lot of storage throughput while the host is also running many other VM's. The problem doesn't happen except under high load.

The issue is still unacceptable.

VMware support just suggested changing the PVSCSI to LSI, but I have not tried it yet.

Reply
0 Kudos
computerguy7
Enthusiast
Enthusiast

I just finished some more testing. The LSI SAS adapter is not compatible with VSA (it won't boot) and the LSI Parallel adapter does work, but it does not resolve the storage issues.

2014-05-21 13_16_05-saturn.co.douglas.or.us - PuTTY.png

robocopy can still bring the VSA cluster to its knees and create crazy high latency for all volumes on the cluster.

Reply
0 Kudos
Braack
Contributor
Contributor

Any news on this one?

I'm on 10.5 yet and I'm seeing similar behaviour.

2 hosts: HP DL380running vsphere 5.1

16 disks per host. 10k RPM 450 GB 2.5inch

One HP VSA per host, 2 vCPU and 5 GB memory per VSA.

Latency is usually very high for VM using the VSA storage. Latency on the physical HBAs is very moderat.

Unbenannt.JPG

Reply
0 Kudos
computerguy7
Enthusiast
Enthusiast

I still have open support tickets with Dell, VMware, and HP. At this point-in-time, my best guess is that the problem is a VMware bug.

HP has already indicated that they believe the issue is outside of the VSA, they will probably state this again after their second review (they are reviewing the issue a second time now). Dell cannot see anything wrong with the hardware and I applied all the newest firmware updates I could find.

Reply
0 Kudos
computerguy7
Enthusiast
Enthusiast

Do you get PVSCSI errors?

Reply
0 Kudos
Braack
Contributor
Contributor

no, there are no PVSCSI errors in my VSA vmware.log.

This latency issue had been there from the first start. Tried it with 9.5, upgrade to 10.0, fresh installation of 10.5.

HP assigned the case to some 3rd level engineers, they collected loads of logs, but nothing came out of it.

We agreed then that the 1gb uplink might be the bottleneck. But last week we upgraded everything to 10 gb - but the performance didn't improve at all.

I noticed one thing: If we use just one node (and the FOM), then performance is very good. If the second node is up again, then latency rises to more than 500 ms.

Reply
0 Kudos
computerguy7
Enthusiast
Enthusiast

Interesting. sounds like the VSA cluster network RAID traffic going through the VMware network stack is introducing the latency.

Reply
0 Kudos
computerguy7
Enthusiast
Enthusiast

braack,

  When you are having latency issues, what does your device queue look like?

2014-06-27 09_49_00-disk queue sample.txt - Notepad.png

I am curious how large your DQLEN is and how high your %USD gets during high load for the iSCSI LUN. Then under high load, it would be interesting to see what your local RAID controller queue stats are too. What model RAID card do you have?

Reply
0 Kudos
Braack
Contributor
Contributor

Did some tests, but was not able to generate the latencies.

So probably it's a complete different issue.

However, I was able to verify that the throughput increases notably if I prevent network raid sync by shutting down one node.

It goes from 35 MB/sec to 54 MB/sec.

Which is still poor.

Test: Robocopy a 20 GB folder with mix of data.

Phy. adapter: HP SmartArray P420i:

naa...074

naa...69a

naa....106

naa...b52

Datastore LUNS:

naa...0070

naa...0062

robo.png

Reply
0 Kudos
computerguy7
Enthusiast
Enthusiast

thank you for sharing this info.  how come your first host has a queue depth of 32 for "naa...0070" while the second host has a queue depth of 128 for the same device? are these screenshots with network RAID on or off?


I have a phone call scheduled with VMware today.

Reply
0 Kudos
Braack
Contributor
Contributor

queue depth of 32 - good question, will investigate. :smileyconfused:

Network RAID was on as screenshot was made.

Reply
0 Kudos
computerguy7
Enthusiast
Enthusiast

I just addded another RAID-1 to my physical RAID adapter so that I can test without VSA in the picture. However, the new datastore also shows a queue depth of 32.

2014-07-02 08_44_06-jupiter.co.douglas.or.us - PuTTY.png

Reply
0 Kudos
computerguy7
Enthusiast
Enthusiast

without the VSA, just VMware to Dell PERC Local Datastore, I get throughput between  70 and 232 MBps. VSA normally was around 50 to 100 MBps.

Reply
0 Kudos
hoangn
Contributor
Contributor

Did you ever figured out what was wrong?

Reply
0 Kudos
MCVMH
Contributor
Contributor

Very old post, but we have similar issues. Latency from the Hosts to the VSA cluster was fixed by disabling LRO on the VSA virtual machines. HP need to do this for you, they also need to set it so it is disabled after reboots.

Our average latency dropped from 30ms to 2ms

We have seen this across a number of customers, seems like LRO should really be disabled by default.

Our other issue is with SCSI Aborts, these will normal coincide with an alert from the VSA's reporting   E00060100 EID_LATENCY_STATUS_EXCESSIVE

We ran consistency checks on the arrays and found no problems. The issue seems to happen at particular times of the day. Currently 3:15 am on a Sunday morning, which is nice.

We did have problems before with storage snapshots causing this problem, so we dropped the frequency from every hour to every 6. This one doesn't seem to be related to storage snapshots though, probably workload from a Virtual machine. A lot of the machines are out of our control though, so difficult to track.

HP just blame the underlying hardware, Dell cant find any issues though.

We are currently using vDisks, is it worth us migrating to RDM's? 

Reply
0 Kudos