Accelerating and Optimizing Server Performance and...

ChrisVLXM · ‎11-21-2011

Since Veloxum makes software geared toward server optimization to increase the performance and capacity of physical and virtual servers, people often ask us, “Do you have any free advice on how we can increase the performance and capacity of our servers?” For these people I often provide the following five (5) often overlooked free methods.

1. Make sure your VMware tools are up-to-date on all guests. VMware’s engineers, like many others, continually improve VMware’s performance release-to-release. You cannot take advantage of those improvements if you run older versions.

2. Place identical operating systems on the same host. This simple tip can increase memory sharing significantly.

3. Using iSCSI? Have you setup your environment for JUMBO MTUs? If your environment supports it, you can expect a 40% or greater improvement in sustainable I/O.

4. Using shared storage? Use vStorage APIs for Array Integration (VAAI) if possible. Review http://kb.vmware.com/kb/1005009 and http://kb.vmware.com/kb/1021976 for more details

5. Another shared storage tip, perform regular virtual machine snapshot maintenance. Keep only the ones that you need - remove them if at all possible.
Lastly, one monitoring tip I always recommend: enable logging level “3” in vCenter. Several latency (delay) metrics are not available unless you do this. As you search for your own ways to increase performance these metrics will help you spot trouble areas.

If you want more tips and methods: veloxum

rickardnobel · ‎11-21-2011

Chris C. wrote:
3. Using iSCSI? Have you setup your environment for JUMBO MTUs? If your environment supports it, you can expect a 40% or greater improvement in sustainable I/O.

40% or more just for Jumbo frames? That seems like a quite high promise. Why should you be able to see such increase in performance (if using good hardware to begin with).

My VMware blog: www.rickardnobel.se

KevinCornell · ‎11-21-2011

Increasing the load capacity of the frame means you're moving more data per packet. The means you have increased the efficiency of your command:data ratio. Less time doing handshaking per byte moved.

But keep in mind, just because you have increased the amount of data that is being transmitted, you still need a balanced transmit/transport/receive flow. If you push too much data on to the network you can cause an overflow condition in the receive buffer. It this happens, your transmit buffer will receive a halt command and this may cause a restart in a slow start condition. This will generate a saw tooth graph of network throughput with an average that is far less than 50% of potential.

So, network management, as in life, needs a measured, balanced approach.

rickardnobel · ‎11-21-2011

KevinCornell wrote:
Increasing the load capacity of the frame means you're moving more data per packet. The means you have increased the efficiency of your command:data ratio. Less time doing handshaking per byte moved.

Yes, there will be less protocol overhead in Jumbo Frames compared to standard sized frames (I have written about this in some detail here), but the difference is more like 4% than 40%.

My VMware blog: www.rickardnobel.se

mlazar2000 · ‎11-23-2011

As with all optimization discussions, how you optimize and what your assumptions are will affect the optimization’s success. Theory aside, in regard to iSCSI and increased throughput you, can gain faster speeds with Jumbo frames but you need to take the following into consideration:

The number of spindles available and the type of RAID level in use may be more of a factor than anything else. If you have a few spindles or created RAID (disk) groups that only use a few spindles your limiting factor will likely have nothing to do with the frame size. I have seen companies create RAID-6 groups with four spindles then wonder why performance is so poor.

Do your switches properly support Jumbo frames (do they have large enough buffers)? Some switches officially support jumbo frames (and VLANS) but the buffers are simply not adequate – there is “support” and there are products designed for Jumbo frames.

TCP and SCSI considerations: What type of TCP congestion control is your SAN vendor using vs. what is ESX using (it is not a straightforward as you may think).

How many NICs are connected to your iSCSI SAN? Are you using MPIO? Has MPIO been setup to change IOPS per round robin session? There are some excellent articles on iSCSI (and MPIO) that can be found in the following places that talk about these issues (I’ve included the older posts as the diagrams and explanations are helpful.):

http://tools.ietf.org/html/rfc3720

http://virtualgeek.typepad.com/virtual_geek/2009/01/a-multivendor-post-to-help-our-mutual-iscsi-cust...

http://virtualgeek.typepad.com/virtual_geek/2009/09/a-multivendor-post-on-using-iscsi-with-vmware-vs...

Now assuming you have many spindles available and your switches are properly handling larger MTU - what difference can it make?

Potentially reduces packet-processing operations by a factor of six. A MTU size of 9000 is large enough to accommodate 8K of data and overhead. If your VM is database, this can make a big difference.

Keep in mind that on top of pure cpu overhead in packet assembly and disassembly there is also introduced latency of the operation itself

Depending on your iSCSI vendor delayed TCP acks may not be a good setting. Additionally consider that “something” (cpu) needs to keep track of the delayed acks. You should check with your SAN vendor on their recommended setting.

Vendors also confirm that Jumbo MTU has a large impact. Here is a reference http://www.netapp.com/us/library/technical-reports/tr-3409.html (claims up to 30% throughput improvement - a bit dated but still relevant)

A seminal paper by Matthew Mathis, Jeffrey Semke, Jamshid Mahdavi, and Teunis Ott found here: http://www.psc.edu/networking/papers/model_abstract.html shows TCP throughput following the following formula

Throughput <= ~0.7 * MSS / (rtt * sqrt(packet_loss))

The equation above tells us that everything being equal, you can double your throughput by doubling the packet size. Of course you need to enough spindles and aggregate bandwidth available to handle it.

I happen to be helping a client this weekend convert to Jumbo MTU. They are on ESX 4.1 using Broadcom nics and Dell MD3200 storage systems. Assuming they do not mind, I will post the results next week.

rickardnobel · ‎11-23-2011

mlazar2000 wrote:
A seminal paper by Matthew Mathis, Jeffrey Semke, Jamshid Mahdavi, and Teunis Ott found here: http://www.psc.edu/networking/papers/model_abstract.html shows TCP throughput following the following formula
Throughput <= ~0.7 * MSS / (rtt * sqrt(packet_loss))
The equation above tells us that everything being equal, you can double your throughput by doubling the packet size.

The link you provide is a bit old (from 1997) and might not really apply in this case. The througput formula is also valid, but mostly on WAN with very high latency. On local networks at gigabit speed and sub millisecond network latencies that principle will perhaps not matter the same as on WAN links.

My point is that I think Jumbo Frames are very good, but assuming that you already has good quality switches who can deliver wire speed throughput and has enough queue on ports, and good physical NIC with hardware offload, then the increase in throughput would be quite small. Essential you should be able to get quite close to 1 Gbit/s throughput already, which makes the room for bandwidth improvements less.

My VMware blog: www.rickardnobel.se

mlazar2000 · ‎11-24-2011

It really does apply in most cases, including LAN links. The spindle count may still be the largest factor, but this is another cirtical piece. Try this with your system(s)

Use vmkping to test the latency of you iSCSI connections. Try it with little or no traffic then try it while normal production traffic or synthetic traffic (iometer, etc) is running. Here is very simple spreadsheet (example):

	Throughput <= ~0.7 * MSS / (rtt * sqrt(packet_loss))
Potential Throughput	0.7	MSS	rtt	packetloss
323184.8	0.7	1460	0.1	0.001
876583.4	0.7	3960	0.1	0.001
1762021.1	0.7	7960	0.1	0.001
1983380.5	0.7	8960	0.1	0.001

TCP window size also has a role and if the nics can handle all the pack processing it will certainly help.

If you have a disk sub-system that can has the spindles and the RAID groups are setup in a reasonable fashion this will make a big difference. Netapp and other vendors’ http://www.netapp.com/us/library/technical-reports/tr-3409.html also support jumbo MTU making a big difference in the proper environment.

rickardnobel · ‎11-24-2011

mlazar2000 wrote:
Potential Throughput 0.7 MSS rtt packetloss
323184.8 0.7 1460 0.1 0.001
876583.4 0.7 3960 0.1 0.001
1762021.1 0.7 7960 0.1 0.001
1983380.5 0.7 8960 0.1 0.001

Could you explain the throughput numbers in your spreadsheet?

Do you mean that with MSS 1460 (i.e. default MTU 1500) you would get 323 KB/s of throughput? And with Jumbo frames enabled we would reach 1,9 MB/s?

I still think the formula does not apply on modern LANs.. I do agree that Jumbo frames is good, but will perhaps not always give a 40% increase in throughput.

My VMware blog: www.rickardnobel.se

mlazar2000 · ‎12-02-2011

The numbers represent the theoretical througput when expressed in MSS (Segment Size). Since the MTU is already (in essense Bytes) the numbers would be 323Mbit/s and 1.9Gbit/s. Please keep in mind that TCP Window size would also play a signifigant role here (when discussing TCP throughput).

The customer was able to schedule the change over to using Jumbo MTU on their production systems. Fortunately the workload is repetitive and makes for a nice comparison chart.

Above is an eight (8) hour average plot of the VM's write rate. The last two full (and third partial) peaks on the right show the impact of moving to MTU 9000. The eight hour average has moved from ~1800Kbps to ~2600Kbps. The only thing changed was the MTU.

rickardnobel · ‎12-03-2011

mlazar2000 wrote:
The numbers represent the theoretical througput when expressed in MSS (Segment Size). Since the MTU is already (in essense Bytes) the numbers would be 323Mbit/s and 1.9Gbit/s.

I still belive the formula is not directly usable on modern gigabit LANs. The throughput limit is not 323 Mbit/s, I have seen on many occasions where single servers (both Windows and ESXi for example) has used almost the full 1000 Mbit/s, without any tweaking and on ordinary frame sizes. As before, I do like Jumbo Frames and thinks it is good thing to setup.

The eight hour average has moved from ~1800Kbps to ~2600Kbps. The only thing changed was the MTU.

It is interesting to see that you got an increase in throughput when changing the MTU, but the network usage is still very low. Even when you got from 1,8 Mbit/s to 2,6 Mbit/sec (out of 1000 Mbit available each second) this is very small number of what Gigabit Ethernet can deliver, where you only use about 0,2 % of the bandwidth.

My VMware blog: www.rickardnobel.se

ChrisVLXM · ‎12-05-2011

In looking at ricnob and mlazar2000, it seems both "right". Ricnob talks about the fact that many "modern systems" can acheive near perfect throughput results. Mlazar2000 doesn't question this, but talks about how you need to have the backend systems set up so that you can generate the traffic necessary to fill the pipe (e.g. correctly sized SAN with disk subsystems that can support the full-rate traffic). I think mlazar2000's point is that when you optimize your server to get the traffic rate up, you can then use jumbo frames to get the fastest throughput. That's my reading at least. Any comments?

rickardnobel · ‎12-05-2011

Hello Chris,

What you wrote is something that I think is true. You will need to optimize tune both the SAN array and also want the best possibilities for the storage traffic over the Ethernet network. As I have stated before, I do like Jumbo Frames and find it interesting to configuring and troubleshooting, so I do not really disagree with your colleague on this.

As everything performance related, it is interesting to discuss from different point of views and be aware that the results of enabling jumbo frames could vary.

My VMware blog: www.rickardnobel.se

All

Accelerating and Optimizing Server Performance and Capacity: 5 Free Methods for VMware Servers