I’m see higher latencies than I would have expected on the access/write times between the ESXi 4.1 U1 server and the SAN Infortrend DS S16E-G1140-4. The value consistently floating around 15 ms, for the LUN that is in active use. There are 2 virtual machines currently on the ESXi host one is idle the other is monitoring server. The machines are on separate LUNs. When the backup runs it clones the machines to another LUN the latencies jump up further during this.
I find the latencies high given that the SAN is directly attached to the ESX server , via 2 cables, plugged from the interfaces on the storage into the interfaces on the ESXi server. Jumbo frames have been enable. And the ESXi server its self is idle.
Before I put more guest machines onto this environment I would like to be sure that there are no issues with the storage configuration.
I have attached the exported san configuration.
The SAN in configured:
14 disks in raid 60 giving 9 TB usable space. This is presented in 4 x 2 TB LUNs and 1 x 1 TB LUN.
Drives are Hitachi HUA72201.
The EXS server configuration:
ESXi 4.1 update 1
Manufacturer: Supermicro
Model: X8DT3
Processors: 12 CPU x 2.666 GHz
Processor Type: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
Hyperthreading: Active
Total Memory: 23.99 GB
Number of NICs: 4
State: Connected
Network:
Name PCI Driver Link Speed Duplex MAC Address MTU Description
vmnic0 0000:02:00.00 igb Up 1000Mbps Full 00:1b:21:8c:72:60 9000 Intel Corporation 82576 Gigabit Network Connection
vmnic1 0000:02:00.01 igb Up 1000Mbps Full 00:1b:21:8c:72:61 9000 Intel Corporation 82576 Gigabit Network Connection
vmnic2 0000:08:00.00 igb Up 1000Mbps Full 00:25:90:29:cc:5e 1500 Intel Corporation 82576 Gigabit Network Connection
vmnic3 0000:08:00.01 igb Up 1000Mbps Full 00:25:90:29:cc:5f 1500 Intel Corporation 82576 Gigabit Network Connection
/vmfs/volumes/4e4e6cb7-3b8e90fc-a964-001b218c7261/scripts/ghettoVCB # esxcfg-vswitch -l
Switch Name Num Ports Used Ports Configured Ports MTU Uplinks
vSwitch0 128 3 128 1500 vmnic2
PortGroup Name VLAN ID Used Ports Uplinks
VM Network 0 0 vmnic2
Management Network 0 1 vmnic2
Switch Name Num Ports Used Ports Configured Ports MTU Uplinks
vSwitch1 128 4 128 1500 vmnic3
PortGroup Name VLAN ID Used Ports Uplinks
Guest Network 0 2 vmnic3
Switch Name Num Ports Used Ports Configured Ports MTU Uplinks
vSwitch2 128 3 128 9000 vmnic0
PortGroup Name VLAN ID Used Ports Uplinks
iSCSI_0 0 1 vmnic0
Switch Name Num Ports Used Ports Configured Ports MTU Uplinks
vSwitch3 128 3 128 9000 vmnic1
PortGroup Name VLAN ID Used Ports Uplinks
iSCSI_1 0 1 vmnic1
Interface Port Group/DVPort IP Family IP Address Netmask Broadcast MAC Address MTU TSO MSS Enabled Type
vmk0 Management Network IPv4 192.168.XX.XX 255.255.255.0 192.168.XX.XX 00:25:90:29:cc:5e 1500 65535 true STATIC
vmk1 iSCSI_0 IPv4 10.10.10.21 255.255.255.0 10.10.10.255 00:50:56:72:13:82 9000 65535 true STATIC
vmk2 iSCSI_1 IPv4 10.10.11.21 255.255.255.0 10.10.11.255 00:50:56:7c:8b:76 9000 65535 true STATIC
Please can you advise where we can tweak to get better performance.
Thank you.
Are the system logs on the ESX host showing any kinds of storage errors or warnings?
Your iSCSI configuration seems to be inline with best practice.
When you check esxtop and break the latency down by LUN are you seeing it on all LUNs on just some specific ones?
(esxtop -> d -> u -> L 40)
Some times a specifc LUN or management LUN may be reporting the high latency.
In that case you need to check to see what VMs are running on the LUN which may be contributing to them.
You also need to see what tier (RAID level) the LUN is running on incase its contributing to the issue.
Also is the latency DAVG or KAVG?
Hi.
Thanks for taking time to advise.
I have attached a file with some captures of the esxtop output as specified, is there a way to capture these results in this format as regular intervals ?
The KAVG is very low mostly 0.00 , whereas the DAVG moves from 17 to as high as 55 for the one LUN and the other as low as 0.28 to 623.65.
The storage is 1 x Raid 60 array. That I have divided up into 5 LUNS (4 x 2TB , 1 x 1.) .
I have tried the following as well ...
I have remove the Jumbo frames, and tried each of the possible path configurations, the results where all the same.
From the DAVG being so high would this indicate the the issue is most likely with the sorage ?
Thanks
Hi,
Yes DAVG relates latency from the Storage Array (If you experience high latency times, investigate current performance metrics and running configuration for the switches and the SAN targets.)
Refer to http://kb.vmware.com/kb/1008205 for more info
It certainly looks like the storage array is the source of the high latency.
As a rule of thumb below 10ms for average DAVG is showing a well performing SAN.
10-30ms is serious performance issues (will see slow downs in VMs)
30+ms you are in trouble.
You don't mind seeing high peaks that disappear after a few seconds.
Your lowest latency on busy LUNs is still 10+ms which is still too high.
I would recommend focusing you attention on the array for the source of the issue.
I have trashed the configuration that was on the SAN when I got it.
The original configuration was Raid 60. 2 Raid 6 arrays of 7 Disks + 1 hot spare.
As a test I setup 1 x raid 1 array of 4 Disks + 1 host spare. The result of this was write latencies genraly not going past 14ms, and and average of 7ms , this is while there is a rebuild of the new array running in the background.
I would have expected the performace of the Raid 60 array to be better than that of the smaller raid 1 array, given the vast difference in the number of spindles.
What would cause this ?
What tool can I used under Linux to benchmark the storage so that I have better comparable statistics ?
Thanks
I thought raid 60 requires 8 disks to work correctly... 4 in a RAID 6 array and then 4 to mirror the RAID 6 array. Have you tried configuring the RAID 60 with 8 disks?
Good Luck
True 8 disk is the requirement for raid 6. ( http://en.wikipedia.org/wiki/Nested_RAID_levels and same here http://www.thinkmate.com/Storage/What_is_Raid ).
The array was configured as 2 by raid 6 with 2 hot spares. Thus the device was misconfigured.
Sad that the interface will allow one to missconfigure it.
Once I have tested the Raid 5 and Raid 1 configuration I will test the raid 6 with 8 disks and see if it works.
Thanks
G
Hi.
As a last resort I downgraded to ESXi 4.0 U 1. the performance doubled.
I originally started out with ESXi 4.1 U 1, then Upgrade to ESXi 5.0 when I thought it was the SAN, and needed support for LUNs >2TB.
I validated this by keeping the SAN configuration and reinstalling ESXi 5.0 again , and the performance reverted to a degraded state. I kept the same configuration on the SAN for the ESXi 4.0 U 1 and the new ESXi 5.0 install.
What would cause ESXi 4.0 U 1 to work correctly but ESXi 4.1 U 1 and ESXi 5.0 to experience degrade performance ?
Thanks
G