2 hosts: Dell PowerEdge 720xd running vsphere 5.5 update 1
10 disks per host. 10k RPM 1.2 TB 2.5inch
One HP VSA per host, 2 vCPU and 8 GB memory per VSA.
All the latest updates for VSA. Almost all the latest vmware updates (they keep releasing more after my update cycle)
my disk latency for a VM (when producing a significant storage load) is very bad. My HP storeVirtual hardware under similar load functions just fine. seems to be something related to VSA + VMware. the vmhba0 (dell perc card) seems to perform fine.
the kavg seems high. this article says kavg is when the vmkernel (VMware) is processing a command.
VMware KB: Using esxtop to identify storage performance issues for ESX / ESXi (multiple versions) http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100820...
Thoughts on what I may have configured wrong or how I should proceed with troubleshooting?
Also, if I use IO meter and test with high IOPS but small writes. everything performs good. If I choose large writes (4 mb) everything performs like crap.
my hp VSA vmware.log file has this:
2014-05-07T21:57:25.574Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xf6
2014-05-07T21:58:35.563Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xf9
2014-05-07T21:58:35.563Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xd0
2014-05-07T21:58:35.563Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xda
2014-05-07T21:58:35.563Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xe4
2014-05-07T21:58:35.563Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xe6
2014-05-07T21:58:35.563Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xd4
2014-05-07T21:58:35.564Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xdb
2014-05-07T21:58:35.564Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xf6
2014-05-07T21:58:35.564Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xfb
2014-05-07T21:58:35.564Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xf4
2014-05-07T21:58:35.564Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xdf
2014-05-07T21:58:35.564Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xe1
2014-05-07T21:58:35.564Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xfe
2014-05-07T21:58:35.564Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xe7
2014-05-07T21:58:35.564Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xff
2014-05-07T21:58:35.564Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xe9
2014-05-07T21:58:35.565Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xef
2014-05-07T21:58:35.565Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xf5
2014-05-07T21:58:35.565Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xfa
2014-05-07T21:58:35.565Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xdc
2014-05-07T21:58:35.565Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xe2
2014-05-07T21:58:35.565Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xe5
2014-05-07T21:58:35.565Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xd5
2014-05-07T21:58:35.565Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xf8
2014-05-07T21:58:35.565Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xeb
2014-05-07T21:58:35.565Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0x100
2014-05-07T21:58:35.566Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xd6
2014-05-07T21:58:35.566Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xd1
2014-05-07T21:58:35.566Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xee
2014-05-07T21:58:35.566Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xcf
2014-05-07T21:58:35.566Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xe0
2014-05-07T21:58:35.566Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xd3
2014-05-07T21:58:35.566Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xea
2014-05-07T21:58:35.566Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xf0
2014-05-07T21:58:35.566Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xfc
2014-05-07T21:58:35.566Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xe3
2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xfd
2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xec
2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xf1
2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xde
2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xe8
2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xd9
2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xd2
2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xd7
2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xd8
2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xf3
2014-05-07T21:58:35.567Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xf7
2014-05-07T21:58:35.568Z| vcpu-1| I120: PVSCSI: scsi1:0: aborting cmd 0xf2
definitely related to my issue.
related hp vsa lefthand logs:
store.info:May 2 18:25:21 (VM-NAME) dbd_store[2982]: store_0::DBD_STORE:device='/dev/disk/by-id/scsi-36000c299a9e43ad9c456c1e50781790b-part2' recent_max_lat=0.000s ops_out=82 oldest_op_out=10.653s write (excessive)
Event Log (369) [01000000]: 2014 Apr 28 20:49:50 (VM-NAME): E00060100 SAN/iQ System Storage system '(VM-NAME)' latency = 66.376, exceeds 60.000.
Event Log (370) [01000000]: 2014 Apr 28 20:51:48 (VM-NAME): E40060100 SAN/iQ System Storage device name '/dev/disk/by-id/scsi-36000c299a9e43ad9c456c1e50781790b-part2': I/O ERROR - error count of 1.
hopefully adding all these logs will help someone someday.
Try updating the nic driver, it might fix your issue.
I am still using older version.. planning to upgrade.. Will check the issue
I've learned that we can best reproduce the errors and storage delays when running a robocopy task that generates a lot of storage throughput while the host is also running many other VM's. The problem doesn't happen except under high load.
The issue is still unacceptable.
VMware support just suggested changing the PVSCSI to LSI, but I have not tried it yet.
I just finished some more testing. The LSI SAS adapter is not compatible with VSA (it won't boot) and the LSI Parallel adapter does work, but it does not resolve the storage issues.
robocopy can still bring the VSA cluster to its knees and create crazy high latency for all volumes on the cluster.
Any news on this one?
I'm on 10.5 yet and I'm seeing similar behaviour.
2 hosts: HP DL380running vsphere 5.1
16 disks per host. 10k RPM 450 GB 2.5inch
One HP VSA per host, 2 vCPU and 5 GB memory per VSA.
Latency is usually very high for VM using the VSA storage. Latency on the physical HBAs is very moderat.
I still have open support tickets with Dell, VMware, and HP. At this point-in-time, my best guess is that the problem is a VMware bug.
HP has already indicated that they believe the issue is outside of the VSA, they will probably state this again after their second review (they are reviewing the issue a second time now). Dell cannot see anything wrong with the hardware and I applied all the newest firmware updates I could find.
Do you get PVSCSI errors?
no, there are no PVSCSI errors in my VSA vmware.log.
This latency issue had been there from the first start. Tried it with 9.5, upgrade to 10.0, fresh installation of 10.5.
HP assigned the case to some 3rd level engineers, they collected loads of logs, but nothing came out of it.
We agreed then that the 1gb uplink might be the bottleneck. But last week we upgraded everything to 10 gb - but the performance didn't improve at all.
I noticed one thing: If we use just one node (and the FOM), then performance is very good. If the second node is up again, then latency rises to more than 500 ms.
Interesting. sounds like the VSA cluster network RAID traffic going through the VMware network stack is introducing the latency.
braack,
When you are having latency issues, what does your device queue look like?
I am curious how large your DQLEN is and how high your %USD gets during high load for the iSCSI LUN. Then under high load, it would be interesting to see what your local RAID controller queue stats are too. What model RAID card do you have?
Did some tests, but was not able to generate the latencies.
So probably it's a complete different issue.
However, I was able to verify that the throughput increases notably if I prevent network raid sync by shutting down one node.
It goes from 35 MB/sec to 54 MB/sec.
Which is still poor.
Test: Robocopy a 20 GB folder with mix of data.
Phy. adapter: HP SmartArray P420i:
naa...074
naa...69a
naa....106
naa...b52
Datastore LUNS:
naa...0070
naa...0062
thank you for sharing this info. how come your first host has a queue depth of 32 for "naa...0070" while the second host has a queue depth of 128 for the same device? are these screenshots with network RAID on or off?
I have a phone call scheduled with VMware today.
queue depth of 32 - good question, will investigate. :smileyconfused:
Network RAID was on as screenshot was made.
I just addded another RAID-1 to my physical RAID adapter so that I can test without VSA in the picture. However, the new datastore also shows a queue depth of 32.
without the VSA, just VMware to Dell PERC Local Datastore, I get throughput between 70 and 232 MBps. VSA normally was around 50 to 100 MBps.
Did you ever figured out what was wrong?
Very old post, but we have similar issues. Latency from the Hosts to the VSA cluster was fixed by disabling LRO on the VSA virtual machines. HP need to do this for you, they also need to set it so it is disabled after reboots.
Our average latency dropped from 30ms to 2ms
We have seen this across a number of customers, seems like LRO should really be disabled by default.
Our other issue is with SCSI Aborts, these will normal coincide with an alert from the VSA's reporting E00060100 EID_LATENCY_STATUS_EXCESSIVE
We ran consistency checks on the arrays and found no problems. The issue seems to happen at particular times of the day. Currently 3:15 am on a Sunday morning, which is nice.
We did have problems before with storage snapshots causing this problem, so we dropped the frequency from every hour to every 6. This one doesn't seem to be related to storage snapshots though, probably workload from a Virtual machine. A lot of the machines are out of our control though, so difficult to track.
HP just blame the underlying hardware, Dell cant find any issues though.
We are currently using vDisks, is it worth us migrating to RDM's?