Re: iSCSI high latency over direct cable connectio...

clubbing80s · ‎08-28-2011

I’m see higher latencies than I would have expected on the access/write times between the ESXi 4.1 U1 server and the SAN Infortrend DS S16E-G1140-4. The value consistently floating around 15 ms, for the LUN that is in active use. There are 2 virtual machines currently on the ESXi host one is idle the other is monitoring server. The machines are on separate LUNs. When the backup runs it clones the machines to another LUN the latencies jump up further during this.

I find the latencies high given that the SAN is directly attached to the ESX server , via 2 cables, plugged from the interfaces on the storage into the interfaces on the ESXi server. Jumbo frames have been enable. And the ESXi server its self is idle.

Before I put more guest machines onto this environment I would like to be sure that there are no issues with the storage configuration.

I have attached the exported san configuration.

The SAN in configured:

14 disks in raid 60 giving 9 TB usable space. This is presented in 4 x 2 TB LUNs and 1 x 1 TB LUN.

Drives are Hitachi HUA72201.

The EXS server configuration:

ESXi 4.1 update 1

Manufacturer: Supermicro

Model: X8DT3

Processors: 12 CPU x 2.666 GHz

Processor Type: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz

Hyperthreading: Active

Total Memory: 23.99 GB

Number of NICs: 4

State: Connected

Network:

Name PCI Driver Link Speed Duplex MAC Address MTU Description

vmnic0 0000:02:00.00 igb Up 1000Mbps Full 00:1b:21:8c:72:60 9000 Intel Corporation 82576 Gigabit Network Connection

vmnic1 0000:02:00.01 igb Up 1000Mbps Full 00:1b:21:8c:72:61 9000 Intel Corporation 82576 Gigabit Network Connection

vmnic2 0000:08:00.00 igb Up 1000Mbps Full 00:25:90:29:cc:5e 1500 Intel Corporation 82576 Gigabit Network Connection

vmnic3 0000:08:00.01 igb Up 1000Mbps Full 00:25:90:29:cc:5f 1500 Intel Corporation 82576 Gigabit Network Connection

/vmfs/volumes/4e4e6cb7-3b8e90fc-a964-001b218c7261/scripts/ghettoVCB # esxcfg-vswitch -l

Switch Name Num Ports Used Ports Configured Ports MTU Uplinks

vSwitch0 128 3 128 1500 vmnic2

PortGroup Name VLAN ID Used Ports Uplinks

VM Network 0 0 vmnic2

Management Network 0 1 vmnic2

Switch Name Num Ports Used Ports Configured Ports MTU Uplinks

vSwitch1 128 4 128 1500 vmnic3

PortGroup Name VLAN ID Used Ports Uplinks

Guest Network 0 2 vmnic3

Switch Name Num Ports Used Ports Configured Ports MTU Uplinks

vSwitch2 128 3 128 9000 vmnic0

PortGroup Name VLAN ID Used Ports Uplinks

iSCSI_0 0 1 vmnic0

Switch Name Num Ports Used Ports Configured Ports MTU Uplinks

vSwitch3 128 3 128 9000 vmnic1

PortGroup Name VLAN ID Used Ports Uplinks

iSCSI_1 0 1 vmnic1

Interface Port Group/DVPort IP Family IP Address Netmask Broadcast MAC Address MTU TSO MSS Enabled Type

vmk0 Management Network IPv4 192.168.XX.XX 255.255.255.0 192.168.XX.XX 00:25:90:29:cc:5e 1500 65535 true STATIC

vmk1 iSCSI_0 IPv4 10.10.10.21 255.255.255.0 10.10.10.255 00:50:56:72:13:82 9000 65535 true STATIC

vmk2 iSCSI_1 IPv4 10.10.11.21 255.255.255.0 10.10.11.255 00:50:56:7c:8b:76 9000 65535 true STATIC

Please can you advise where we can tweak to get better performance.

Thank you.

VeyronMick · ‎09-04-2011

Are the system logs on the ESX host showing any kinds of storage errors or warnings?

Your iSCSI configuration seems to be inline with best practice.

When you check esxtop and break the latency down by LUN are you seeing it on all LUNs on just some specific ones?

(esxtop -> d -> u -> L 40)

Some times a specifc LUN or management LUN may be reporting the high latency.

In that case you need to check to see what VMs are running on the LUN which may be contributing to them.

You also need to see what tier (RAID level) the LUN is running on incase its contributing to the issue.

Also is the latency DAVG or KAVG?

clubbing80s · ‎09-04-2011

Hi.

Thanks for taking time to advise.

I have attached a file with some captures of the esxtop output as specified, is there a way to capture these results in this format as regular intervals ?

The KAVG is very low mostly 0.00 , whereas the DAVG moves from 17 to as high as 55 for the one LUN and the other as low as 0.28 to 623.65.

The storage is 1 x Raid 60 array. That I have divided up into 5 LUNS (4 x 2TB , 1 x 1.) .

I have tried the following as well ...

I have remove the Jumbo frames, and tried each of the possible path configurations, the results where all the same.

From the DAVG being so high would this indicate the the issue is most likely with the sorage ?

Thanks

f10 · ‎09-04-2011

Hi,

Yes DAVG relates latency from the Storage Array (If you experience high latency times, investigate current performance metrics and running configuration for the switches and the SAN targets.)

Refer to http://kb.vmware.com/kb/1008205 for more info

Regards, Arun Pandey VCP 3,4,5 | VCAP-DCA | NCDA | HPUX-CSA | http://highoncloud.blogspot.in/ If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".

VeyronMick · ‎09-05-2011

It certainly looks like the storage array is the source of the high latency.

As a rule of thumb below 10ms for average DAVG is showing a well performing SAN.

10-30ms is serious performance issues (will see slow downs in VMs)

30+ms you are in trouble.

You don't mind seeing high peaks that disappear after a few seconds.

Your lowest latency on busy LUNs is still 10+ms which is still too high.

I would recommend focusing you attention on the array for the source of the issue.

clubbing80s · ‎09-05-2011

I have trashed the configuration that was on the SAN when I got it.

The original configuration was Raid 60. 2 Raid 6 arrays of 7 Disks + 1 hot spare.

As a test I setup 1 x raid 1 array of 4 Disks + 1 host spare. The result of this was write latencies genraly not going past 14ms, and and average of 7ms , this is while there is a rebuild of the new array running in the background.

I would have expected the performace of the Raid 60 array to be better than that of the smaller raid 1 array, given the vast difference in the number of spindles.

What would cause this ?

What tool can I used under Linux to benchmark the storage so that I have better comparable statistics ?

Thanks

JimPeluso · ‎09-05-2011

I thought raid 60 requires 8 disks to work correctly... 4 in a RAID 6 array and then 4 to mirror the RAID 6 array. Have you tried configuring the RAID 60 with 8 disks?

Good Luck

"The only thing that interferes with my learning is my education." If you found this information useful, please consider awarding points for "Correct" or "Helpful"

clubbing80s · ‎09-05-2011

True 8 disk is the requirement for raid 6. ( http://en.wikipedia.org/wiki/Nested_RAID_levels and same here http://www.thinkmate.com/Storage/What_is_Raid ).

The array was configured as 2 by raid 6 with 2 hot spares. Thus the device was misconfigured.

Sad that the interface will allow one to missconfigure it.

Once I have tested the Raid 5 and Raid 1 configuration I will test the raid 6 with 8 disks and see if it works.

Thanks

G

clubbing80s · ‎09-20-2011

Hi.

As a last resort I downgraded to ESXi 4.0 U 1. the performance doubled.

I originally started out with ESXi 4.1 U 1, then Upgrade to ESXi 5.0 when I thought it was the SAN, and needed support for LUNs >2TB.

I validated this by keeping the SAN configuration and reinstalling ESXi 5.0 again , and the performance reverted to a degraded state. I kept the same configuration on the SAN for the ESXi 4.0 U 1 and the new ESXi 5.0 install.

What would cause ESXi 4.0 U 1 to work correctly but ESXi 4.1 U 1 and ESXi 5.0 to experience degrade performance ?

Thanks

G

All

iSCSI high latency over direct cable connections ?