Solved: Periods of high disk latency

aarondovetail · ‎09-21-2010

Hello -- I just moved to a new company and i'm working on virtualizing most of the servers here. It's a smaller IT shop with 20-30 servers. I purchased 3 new HP DL380 G7 servers and I have about half of the 20 physical servers converted. I've noticed an issue that pops up and what it seems like random times.

There are 2 arrays (13 disk raid 6, 5 disk raid 5) 2 luns each, one VMFS per LUN shared between the 3 servers. This is on an HP P2000 G3 FC SAN (Already in place prior). Every once and awhile I'll notice very high disk latency from what it seems like only on the LUNs on the Raid 5 array.

I've attached a screenshot from ESXTOP running on all 3 ESX servers showing the high latency. I can understand if there was very high usage but there really isn't even anything going on at this time as you can see from the screen shots. The only tweak I've done so far is increase the queue depth to 64.

Using Brocade 8GB HBAs hooked to HP (Brocade) switches. Everything is running at 8GB and all of the drives in the SAN are 6GB SAS.

I'm new to this and HP SANs, came from an IBM DS4xxx world, so I'm not even sure how yet to monitor the performance on the SAN itself, the web interface doesn't seem to have any monitoring options what-so-ever.

Any help is appreciated --

Thanks

mcowger · ‎09-23-2010

I very much agree it was a lun ownerhip issue.

As far as maxing out controllers, theres more to the question than just throughput. Its actually very common on these lowend arrays that unless you are doing something trivial like a streaming read that the controller CPU will not be able to keep up with a high (or even moderate) IOPs workload. Its not the bandwidth, its the CPU on the controller. Its actually even more common than you might expect on the large enterprise arrays (HDS USP, EMC Symm, etc.).

--Matt

VCP, VCDX #52, Unix Geek, Storage Nerd

--Matt VCDX #52 blog.cowger.us

View solution in original post

jpdicicco · ‎09-22-2010

Certainly not sure if this is what you're facing, but I have seen a similar issue with a bad port on a switch. It was causing dropped FC frames, but only intermittently. Fortunately for that company, they had HP servers, Brocade switches purchased through HP, and an HP SAN. So, only HP and VMware were needed to troubleshoot... It only took 1 week for engineers to analyze the logs and figure it out.

If it's intermittent like this and you have support, I would engage the vendor early in the process. The sooner they have logs to review, the sooner they become useful.

Happy virtualizing!

JP

Please consider awarding points to helpful or correct replies.

Happy virtualizing! JP Please consider awarding points to helpful or correct replies.

aarondovetail · ‎09-22-2010

I tried one thing last night but I don't have enough data yet to see if it resolved the issue. The Raid 6 array that was working fine I noticed was owned by Controller B and the Raid 5 was on Controller A. So last night I moved the ownership of that vdisk/LUNs to Controller B. I haven't as of yet seen any high latency but I'll have to wait a day or so to see if it holds out that way.

I was watching last night after I made the change and during one of our nightly SQL updates was running I didn't see anything higher than 20ms or so, but I'm not even sure if what's happening is due to high load or just random.

On a side note, does anyone know how to easily monitor the I/O performance on this SAN? (HP P2000)? Is it some 3rd party tool or hidden in the built in HP tool?

Thanks

aarondovetail · ‎09-22-2010

Since moving the Raid 5 array to Controller B I haven't seen any other incidents, I did see one time where the latency spiked at 150ms during heavy load, which isn't great but it's a whole lot better than 500-1500ms.

The only other array on Controller A is a 5TB array for a physical pretty big SQL server that we have here. So is it really possible that one heavy use server is able to max out a whole controller in this P2000 SAN? Is this thing really that weak? Everything to the controllers is 8GB and I don't think it's even coming close to maxing out the bandwidth.

rsingler · ‎09-23-2010

Yes it is possible that the SQL server is causing the other servers attached to the SAN to suffer. I'm not real sure about the layout of the entire disk subsystem here, but you can use some simple math to figure out what you can do with what you have.

I usually use 150 IOPs as a number for what a FC/SAS 15K disk can handle. If you look at your two raid groups that the VMware cluster is using you can effectively get about 1950 IOPs out of your RAID 6 group and 750 IOPs out of your RAID 5 group. That's not really very much during the heavy times. e.g. Patch updates, virus scans, etc. You would probably be better served by taking all 18 of those disks and make a single raid group to increase the overall IOPs you can get out of it. Of course, that assumes they are all the same size and there is no limitation put on you by the array.

As far as the SQL server goes, you need to look at the processor utilization in your SAN controller. I would bet it's up around 100% during the SQL servers peak loads. That would effect all other IO going on through that controller.

It sounds like you have some capacity planning to do throughout your entire environment. Who knows, you might be able to justify a new disk purchase with what you find. Hope this helps...

mcowger · ‎09-23-2010

I very much agree it was a lun ownerhip issue.

As far as maxing out controllers, theres more to the question than just throughput. Its actually very common on these lowend arrays that unless you are doing something trivial like a streaming read that the controller CPU will not be able to keep up with a high (or even moderate) IOPs workload. Its not the bandwidth, its the CPU on the controller. Its actually even more common than you might expect on the large enterprise arrays (HDS USP, EMC Symm, etc.).

--Matt

VCP, VCDX #52, Unix Geek, Storage Nerd

--Matt VCDX #52 blog.cowger.us

aarondovetail · ‎09-23-2010

I haven't seen any further issues, i'm just surprised that one box could max out anything, CPU/Bandwidth on a controller of a SAN. I could go crazy on the DS4500s and DS4800s I've worked on and not max it out. Thanks for the help.

Now I just have to figure out how to even watch the CPU and utilization on this thing, the built in HP tool is just horrible, I thought IBM Storage Manager was bad..

All

Periods of high disk latency