Re: EqualLogic High Latency to ESX 4.1

chris_delaney · ‎07-20-2011

Dear All,

We have a production EqualLogic group containing a PS3600 and PS5000 connected to IBM x3650 M3 ESX 4.1 hosts via Cisco 3750 switches. All has been running fine until last Friday morning when we began to see excessive disk latency (occasionally approaching 1200 ms) on all ESX hosts causing all out VMs and direct-attached NTFS iSCSI volumes to run extremely slowly.

As far as I know there aren't any issues with the Cisco equipment and nothing has changed on the hardware side (plus there are no hardware errors on either the EqualLogic arrays or VMWare hosts) so it's a bit of a puzzler. I've been monitoring the network with various tools and there are no obvious problems with excessive broadcasts or flooding. In fact, the network appears to be absolutely fine.

Has anyone else seen this kind of behaviour and, if so, what might be a good place to try to resolve it? I've found an article regarding disabling delayed ACK but I'm nervous in case it causes more long-term network issues - does anyone think this is a sensible place to start?

Many thanks.

Chris

vmroyale · ‎07-20-2011

Hello.

How about your VMs? Have you checked to see that one or more of them aren't issuing a ton of disk commands?

Good Luck!

chris_delaney · ‎07-20-2011

Hello,

What's the best metric to use - I'm looking at 'Read Requests' at the moment. Would a different one be more useful as the figures I'm seeing seem quite low (maximum of 375, highest average of 5.6).

Many thanks.

Chris

vmroyale · ‎07-20-2011

I prefer to use esxtop and look at the CMDS/s. In the vSphere client, I use the counters with a Rollup of "Summation" under Disk.

chris_delaney · ‎07-20-2011

Thanks for that - I'll monitor them alongside the latency figures.

I've been able to disable delayed ACK on one of the ESX hosts' iSCSI initiator but it doesn't seem to have made any difference sadly.

I've also logged a call with EqualLogic to see if they can shed any light on what might be happening.

Thanks again.

Chris

chris_delaney · ‎07-22-2011

Dear All,

Following some delays in getting EqualLogic 3rd level engineers to look at this issue it turns out that one of the drives in one of the arrays had a hardware fault which goes seemingly unreported (by an alert) in either the main web GUI or SAN-HQ. The queue depth was showing (if you manually sifted through enough data) a significantly higher figure for the drive at exactly the time that the latency issues started and spiked in coordination with the latency subsequently.

EqualLogic have sent out another drive but in the interim to improve performance we have simply pulled the faulty drive so that the array has failed over to a spare. So far things appear to be OK - average latency has dropped from over 200 ms to about 14 ms.

It makes me question whether the EqualLogic alerts ought to flag this kind of issue up. One drive failure pretty much brought the entire virtual infrastructure to a halt which is contrary to what we've come to expect (and rely on) from our EqualLogic kit which, up to this point, has been excellent.

We've requested that an alert metric be included in either firmware or the next release of SAN-HQ which permits monitoring of disk queue depths as I imagine we aren't the first and certainly won't be the last people to come across this problem.

Thanks for all your suggestions.

Chris

alvinswim · ‎08-02-2011

Chris,

at the point you had high disk latency did you see any of your VM's loose net connectivity or exhibit any type of communications issues on the vm nets?

we have a wacky issue on our side doesn't seem to be related to EQ but Dell keeps pushing us towards an EQ latency issue..

chris_delaney · ‎08-03-2011

Hi Alvinswim,

The main latency points coincided with the 09:30 mass logon and 17:00 mass logoff as users came and went from work. The main file store VM uses the Windows iSCSI initiator to connect directly to EQ volumes and then shares them out as NTFS shares. Whilst Windows profiles were being moved around things slowed down a LOT.

Generally speaking though the latency was consistently high thoughout the whole time except at night when most activity (apart from backups) stopped. We never had any network dropouts on the VM networks per-say (corporate LAN as opposed to SAN) but things were so slow that it was basically unusable.

It depends largely, I imagine, on how you are moving the iSCSI traffic around (i.e. is it VLAN'd on the corporate network or do you have a physically separate network for iSCSI traffic) and whether you have separate virtual switches specifically for connections between the ESX hosts and EQ arrays as per the EQ best practice.

I tend to separate both the iSCSI traffic both physically and on the virtual side just because it can't then affect our aging corporate LAN!

From what you describe it sounds like you have the iSCSI traffic VLAN'd so it might be worth putting a machine with wireshark on to see if there's anything weird happening on the network anywhere.

My advice would be to have a look at SAN-HQ (if you haven't already) and check to see if there is latency there alongside latency being recorded by vCenter at the ESX host and VM-level. I can't emphasis enough how useful SAN-HQ is - even though it does look a little overwhelming the first time you open it up.

Cheers.

Chris