AndyR8939
Enthusiast
Enthusiast

Serious latency problems - 17000ms!

Jump to solution

Hey,

I'm currently working on a new project where our company had a 3rd party in to setup a new data center for us.  Its fairly small at the minute but consists of:-

HP c7000 Blade Chassis

2x BL490c G7 Blades

HP P6500 EVA FC SAN

Now, we've installed ESXi5 on both the blade and got vCenter running and have a bunch of VMs created, all seemed to be working fine.  We have another setup running in DR, same equipment and have replication going between the two with Veeam and all seemed to be working OK.

Now the last couple of days we have had intermitted problems with response times, VMs corrupting etc, and during the troubleshooting we noticed we have extreamly high latency on our VMFS, the highest being 17000ms, but with an average of 2100ms!

I noticed the SAN which is a P6500 on the hosts had been setup with MRU PSP which is not what HP recommend so I changed this to RR and also set the diskiosize to 128km as it was set at 32mb.  Same problems.

I tried rebooting hosts, patching them etc and no fix.  I have just rebooted one host and moved 1 VM which is base linux OS, nothing more onto this host and boom straight away I'm getting write latancy of 1900ms.


We have 2 dedicated Oracle blades in the same blade chassis which connect to the same SAN, but different LUNS and these don't have any performance issues so I'm thinking it must be VMware releated.

Any ideas as I'm lost at the moment.

Andy

0 Kudos
1 Solution

Accepted Solutions
mcowger
Immortal
Immortal

Dont forget that on most EVAs, the backend disks are generally shared among all workloads, so even if a given LUN has no workload, the array and its disks are still working.

The high DAVG/cmd clearly points the finger at the array* - you need to talk to HP about whats causing the perf. issue, as its not VMware. 

DAVG/cmd counts the amount of time from when the IO request leaves the host's HBA until when it comes back - none of which ESXi is involved in, so any high DAVG/cmd value is the result of SAN or array performance.

--Matt VCDX #52 blog.cowger.us

View solution in original post

0 Kudos
9 Replies
mcowger
Immortal
Immortal

Are you sure you aren't just overloading the disks?

Check the ESXtop counters, specifically DAVG/cmd.  If thats high, your latency is coming from the array.

--Matt VCDX #52 blog.cowger.us
AndyR8939
Enthusiast
Enthusiast

Don't see how it came be the array overloaded, we only have 4 VMs running at the moment and they are just application servers not even in production yet.

Have esxtop running on my hosts at the moment on a 2 second refresh and DAVG/cmd doesn't drop below 0.11 but then spikes to 29387.23.  Thats with a single VM running on a host and is the only VM on a 100GB VMFS.

See a grab here -

http://imageshack.us/a/img641/7012/latencyesx02.png

Thats one of my hosts.  At the minute I have 1 VM running on it, just an OS doing nothing and its on the store that generated the 324.72 DAV/CMD.  The other store which generated the 14916.63 has no VMs on that host, but does have other VMs on another host, but it just spikes like that.

0 Kudos
mcowger
Immortal
Immortal

Dont forget that on most EVAs, the backend disks are generally shared among all workloads, so even if a given LUN has no workload, the array and its disks are still working.

The high DAVG/cmd clearly points the finger at the array* - you need to talk to HP about whats causing the perf. issue, as its not VMware. 

DAVG/cmd counts the amount of time from when the IO request leaves the host's HBA until when it comes back - none of which ESXi is involved in, so any high DAVG/cmd value is the result of SAN or array performance.

--Matt VCDX #52 blog.cowger.us

View solution in original post

0 Kudos
AndyR8939
Enthusiast
Enthusiast

Ah OK, thanks for the pointers on DAVG/cmd.

We were thinking it was ESXi related because the 2 Oracle blades which use the same SAN & blade chassis are working fine, its only the ESXi hosts having the latency issues.  Will pass it back HP though.

0 Kudos
mcowger
Immortal
Immortal

Oracle may just be weathering the spikes by writing to SGA cache. I suspect if you actually watch IOstat on those hosts you'll see the same spikes.

--Matt VCDX #52 blog.cowger.us
0 Kudos
AndyR8939
Enthusiast
Enthusiast

Strange point too.

I have a VM which has a single VMDK on SCSI0:1 and then a Physical RMD on SCSI0:2

When I look at the stats for this VM sitting idle, SCSI0:1 (VMDK) has a LAT/wr of 1497 but SCSI0:2 (pRDM) doing a 10Gb file copy on the same drive has a LAT/wr of 322.  So the idle VMDK has a higher latency than the active pRDM?

0 Kudos
mcowger
Immortal
Immortal

I dont think this is a fair comparison on an array that already showing performance spikes.

--Matt VCDX #52 blog.cowger.us
0 Kudos
AndyR8939
Enthusiast
Enthusiast

No worries, wasn't sure if it pointed to anything thats all.  Thanks Smiley Happy

0 Kudos
AndyR8939
Enthusiast
Enthusiast

Seems we might have found the root cause.


We spotted a change had been made on one of the FC switches an hour before we started seeing the issues.  One of the other guys had zoned in a tape drive so we rolled this back but it didn't have any effect so appeared to be a red herring.  But, while in the FC switch we noticed that 1 of the 4 ports was showing a huge amount of increasing errors.  We disabled this port and instantly the latency dropped down to between 2-7ms.

Have now put the issue over to HP to either replace the faulty cable/sfp or an issue of the FC Switch Port.


Thanks for the help!

0 Kudos