High Disk Latency

RichieHall · ‎07-26-2011

All hosts in our vSphere cluster, we are seeing "Physical Device Write Latency" with many peaks on several VMFS stores between 200ms and 3000ms (yes, 3 seconds). As you would expect, our users are seeing poor VM performance, guest disk errors, non-responsive VMs, database diconnections. But the NetApp storage statistics show performance is OK.

In some cases, switching the path for a VMFS to the other HBA makes the latency drop to 0 instantly. In other cases, this has no effect, but if we v-Motion the affected guest to another host, the latency and performance are fine, at least for a short time.

We firest saw this issue when we upgraded the cluster to ESX 4.1 U1. But we have since downgraded back to ESX 4.0 U2 and still see the same problems. Around the same time we also upgraded the NetApp from ONTAP 7.3.3 to 8.0.1P3 7-Mode.

Has anybody seen similar issues before?

Environment:

3 Clusters in the Datacenter. Cluster A has 4 x ESX 3.5 U5 hosts. Cluster B has 5 x ESX 3.5 U5 hosts. Cluster C has 10 x ESX 4.0 U2 hosts.

The SAN is a pair of NetApp FAS3140 filers in Active/Active mode running ONTAP 8.0.1P3 7-Mode. Some aggregates are FC disks, some are SATA disks.

All hosts in all 3 clusters are connected to all VMFS datastores. However we intend to separate out the 3.5 and 4.0 clusters to have their own dedicated datastores so that we can enable ALUA and Round Robin for the vSphere hosts.

ALUA is currently disabled on all NetApp iGroups.

HA and DRS are enabled and DRS is Fully Automated.

We have run the NetApp Virtual Storage Console 2.1 to set the recommended settings (timeouts and MPIO) on the ESX hosts. All paths are Fixed and are correctly set to the correct filer.

The Fibre Switches were recently upgraded to the latest firmware in an attempt to fix this issue, no effect.

FredPeterson · ‎07-26-2011

What does your aggregate throughput look like in terms of both IOs and used bandwidth? Was this problem gradual or sudden?

When IO get slow its a result of the disk not being able to keep up with the number of IOs being asked of it. You are running the head as one giant RAID DP, or mostly? Is the cache enabled?

What does read latency look like? If you go into ESXTOP and look at the disk stats, do you see any errors? Is there anything logged in the vmkernel logs that give an indication of a problem?

Are the heads shared with anything else that might be chewing up bandwidth or IO?

RichieHall · ‎07-26-2011

Hi FredPeterson. Thanks for the detailed response.

Aggregate throughput: we have 6 SATA aggregates on Filer A and 5 FC Aggregates on Filer B. Each is comprised of 13-14 disks (except the most recent SATA aggregate, which has 23 disks). The SATA aggregates are usually between 250-750 IOPS, though they occasionally peak over 1000 IOPS for short periods. The FC aggregates are usually between 500-1000 IOPS, though can sometimes peak around 1500.

If my calculations are correct:

SATA aggs with RAID DP should be capable of (13 - 2) * 70 IOPS = 770.

FC aggs should be capable of (13 - 2 ) * 125 = 1375 (for Aggs with 10K disks), or (13 - 2 ) * 175 = 1925.

OK, so it seems that we could well be hitting the max IOPS for our SATA aggs and possibly for the 10K FC aggs too. Am I right that we can expect performance degradation if we go beyond 90% of the theoretical max IOPS?

The problem appears relatively sudden (last 4 weeks). One change we have made is deploying AV clients to VMs - this could explain a big jump in IOPS requirements.

Read latency is has almost the same pattern as write latency, just the peaks are not quite as severe.

Which disk stats in particular should I look for in ESXTOP that will show me errors? Usually I use the vSphere client disk stats.

VMkernel logs do show frequent "fast path state in doubt" messages, though we think we've had these for a long time and looking on forums we've not found anyone who has found a fix.

The heads are also used for our main physical Exchange 2010 servers.

FredPeterson · ‎07-27-2011

Going above the calculated max IOPS on SATA will result in big latency regardless of IO size. If you never really reach the "limit", I would not think there would be any degradation and the degradation you are talking about is serious. Even going above slightly is only going to result in a slight bump (depending on size of IO request of course) in latency but not so severe. Thats what cache is for. I'd also make sure that that is being shared nicely too and that your exchange luns aren't hogging it all even though 2010 is less demanding to disk due.

I also forgot to ask, do you have SIOC turned on? Can you even turn it on? It rather sucks you have to have Enterprise Plus for something handy.

For ESXTOP - when you go to the disk screen (press d) and press f you'll see the available fields and one of them is error stats as well as general latency. Generically you want to watch the GAVG field for latency - this is the latency that is perceived by the virtual machine and is the sum of KAVG + DAVG. KAVG is any latency that is a result of the vSphere vmkernel and DAVG is the response latency of the SAN fabric and I say that rather then disk because its possible a bad fibre cable or gbic or switch or service processor is contributing to DAVG and not the disk itself. You should be able to see the disk latencies from the filers perspective anyway which will generically tell you that they are servicing requests in a timely fashion in the typical NetApp microsecond counter (we have two heads here also)

RichieHall · ‎08-03-2011

We've found a workaround for this issue: setting the FC switch ports to 2Gb fixed rather than 4Gb Autonegotiate.

We do still have a problem with SATA aggregates going over a desirable IOPS threshold, but that wasn't explaining all the latency peaks we were seeing.

Now we're on 2Gb for ESX hosts, Latency is rarely peaking over 50ms now and the average is below 10ms. Nice. Our users are reporting that VM performance has improved. The individual ports are actually never getting anywhere 2Gb each, so there's no desperate need to increase them back to 4Gb just yet.

However, we should be able to run at 4Gb. The FC switches are capable of up to 8Gb, and the NetApp filer heads are running happily at 4Gb since we upgraded the heads a couple of months ago.

So it looks like we still have a fault either with our HBAs or the fibre switches. We're going to engage with HP on this to help us find the root cause. But I'm really happy that we finally found a workaround!

Regarding hitting the IOPS limits on SATA aggregates, we've added more storage with larger aggregates (23 disks each) and we'll be moving high I/O VMs away from the overloaded SATA stores and onto FC stores.

Thanks again for your help.

mhost · ‎09-28-2011

Hello Richie,

Sorry to bump this old post.

We are seing the same symptoms after upgrade to 8.0.1P3 7-Mode - and with roughly the same environment.

FAS3140 with a mix of FC and SATA disks

26 ESXi hosts running 4.1u1

We can actually recreate the problem just by performing a simple storage vMotion.

Did you eventually get an official solution for your problem?
Or are you still using the "workaround" with 2Gb setting on the SAN switch ports?

Best regards

Martin

RichieHall · ‎10-25-2011

We finally found the root cause of this. It is the FC passthrough modules in our HP c-Class enclosure. HP sent us a replacement one and that works like a dream. No more errors logged on the FC switch and performance is great at 4Gb. We're just waiting for HP to send us another replacement so we've then got both the modules replaced.

It's odd that BOTH modules had the same issue, I can only guess it's due to an old hardware revision.