I have problems finding the root cause to why several of our VM's have extreme latency on IO. Example
Our virtual Exchange 2003 server have Average Disk sec/Reads and Disk sec/writes with latency on about 60-80ms. Normal peaks are 200ms, with the occational peak over 1000ms.
A backup attempt of a virtual fileserver (2003 server) were aborted when total bandwith never came over 1.2MB/s
vmkernel logs show a lot of abort and async io error messages
Jan 16 19:01:27 esx1 vmkernel: 4:04:58:31.887 cpu0:1028)SCSI: 3731: AsyncIO timeout (5000); aborting cmd w/ sn 133580, handle 3044/0x7a02a40 Jan 16 19:01:27 esx1 vmkernel: 4:04:58:31.887 cpu0:1028)LinSCSI: 3616:
Aborting cmds with world 1024, originHandle 0x7a02a40, originSN 133580 from
vmhba0:2:3
Jan 16 19:01:27 esx1 vmkernel: 4:04:58:31.887 cpu0:1028)LinSCSI: 3632: Abort failed for cmd with serial=133580, status=bad0001, retval=bad0001
These messages indicate that the ESX is waiting 5000ms for a response from the storage. When it gets none it sends an abort to storage. However this fails as well. ESX then sends a reset and this finally is successful.
Our virtual environment consist of two HP DL servers with QLA4052 iSCSI HBA's connected to the SAN through a HP Ethernet switch. SAN consist of 11 7200rpm FCATA disks. Most vmkfs volumes are RAID5 with 9 disks or RAID10.
We also have a test environment (put together to resolve this issue) of one HP DL server with both iSCSI and FC HBA. Both test and production run ESX v 3.0.2. All hardware is on the vmware HCL.
I know the storage is not enterprise class, but we have a standard diskless windows 2003 server that boots from SAN and have all its disks there, and it perfoms on par or better than the servers that have internal 10K rpm RAID10 SAS disks. A review by the vendor also show that the SAN is lightly loaded with peaks at about 80% of the IOPS it can handle.
When monitoring filetransfers between the vm's and ohysical servers we see behaviour we dont understand and dont see on physical to physical server filetransfers;
# Smaller files (less than 400MB) uses about 12% of the available 1GB bandwith on the NIC througout the transfer.
# Larger files (1GB+) start out at same level, but within a short time (30-120sec) transfer drops to 1%. Transfer speed gradually picks up but are never close to 12%.
To me this looks like an issue with ESX 3.0.2 and the SAN, since 3.0.2 vm's perform very well when connected to something else than the SAN, and the SAN performes well with servers not running ESX 3.0.2. Am I the only one that have ssen this behaviour?
What HP switch do you have there?
And do you have both ports on your HBAs connected to the one switch?
How many NIC ports have you connected on the ISCSI SAN?
These are almost always network issues, does your switch have flow control enabled?, jumbo frame support enabled? and or any unicast storm protection enabled?
I have also seen issues whereby the ESX host and the ISCSI SAN get their paths mixed up, causes a constant hopping from path to path, try disconnecting one of your HBA paths and try the performance again,
The switch is the 2900-24. We only use one port from each HBA (yes, its not redundant). After we chenged the old switch to this new one and installed latest firmware we have zero errors on the switch. We have flow control enabled and jumbo frames disabled. Removing iSCSI and adding FC did not help on the test-vm
Dude, that switch is a 10/100!
You MUST have a high quality switch (HP 2848, etc) - these things list for over $3,000/ea, but it's because they can handle the load. Your run-of-the-mill switches can't.
----
Carter Manucy
Can you post here:
esxcfg-mpath -l
fdisk -l
esxcfg-vmhbadevs -q
Are you using 2 x 1Gb switches?
What storage there (hp wasn't certified with Qlogic iscsi hbas)?
How fast is vm cloning (e.g. on one lun)?
>Dude, that switch is a 10/100!
Don't think so - http://www.hp.com/rnd/products/switches/ProCurve_Switch_2900_Series/overview.htm
>Removing iSCSI and adding FC did not help on the test-vm
Can you explain it more?
>Dude, that switch is a 10/100!
Don't think so - http://www.hp.com/rnd/products/switches/ProCurve_Switch_2900_Series/overview.htm
I'm thinking "2900-24" as in Cisco 2900, 24-port switch, not HP.
----
Carter Manucy
Procurve 2900-24G is a GB swicth. I know there are people advocating Cisco only, but I dont think the problem is th iSCSI network. As previously mentioned the physica 2003 server are perfoming OK with all disks in the SAN.
Have to get back too you with the results of the commands. I'm trying to have the weekend of (the first one in 2008) :). But could you explain what you mean when you say"hp wasn't certified with Qlogic hbas"?
The DL 385 and the Qlogic QLA4052c are surely on the HCL!
>But could you explain what you mean when you say"hp wasn't certified with Qlogic hbas"?
>The DL 385 and the Qlogic QLA4052c are surely on the HCL!
You are right both are on HCL - but the problem is the hp support. The Qlogic iscsi hba is not certified by HP - so you can't get any support when you have issue with it (we discussed that here in the past - http://communities.vmware.com/message/679019#679019). That doesn't mean it won't work - only the support is problematic.
Hi there - can you view the "dropped frames" counters on the switch? Could be a buffering problem - see this doc for more info http://www.vmware.com/pdf/iscsi_storage_esx.pdf
We had a lot of packetdrops on the previous switch. That maight have to do with the fact that it's firmware had not been updated since we bought it, we had flow control off and jumboframes on. We have since then replaced it with a similar Procurve 2900-24G but now with latest fimware, flowcontrol on and jumbo frames off. We have no dropped packets (at least not the two weeks it's been running) now, but performace is just marginally better.
However, we have tested with Fibre Channel, bypassing both Procurve switch and iSCSI HBA, and there was no noticeable increase in performace.
But I have to agree with what you say; seems like a buffering problem. But I believe it has to be buffers either in software or hardware on the ESX server, or maybe on the SAN controller.
I'm using DL385G2's with 4052C's connected to Procurve 5406's with no issues. You definately want to keep flow control on and jumbo's off when using procurves. I saw performance issues when they were both enabled, but great improvements with just flow control.
If you found this or any other post helpful please consider the use of the Helpfull/Correct buttons to award points
Previous posts have all been good and valid, but not really helpful in respect to my issue. Do you rund Exchange as VM on your 385G2's? What kind of performance do you have?