7 Replies Latest reply on Sep 21, 2008 12:32 PM by devzero

    The struggle for disk IO scalability in Server 1.x and RC2

    davidyon Lurker

       

      I've got a production 1.04 VMWare Server that performed "ok" at first but then I've gotten steadily increasing complaints from users about sluggishness.  The server hosts six XP guests, and three Linux guests (one FC4 + two RHEL4).  The XP guests are replacements for physical workstations, which folks RDP into to get their work done.  The RHEL4 machines host a live web server and the development sandbox for the live server.  Most of the time, only one or two of the XP workstations are active, and the web server sees very, very light use. 

       

       

      The 1.x VMWare Server is deployed on the following:

       

      • Dell PowerEdge 1800

      • Dual 3GHz Xeon (hyperthreaded)

      • 8G RAM

      • 320G SATA in RAID1 using software RAID

      • OpenSuSE 10.1 (x64)

       

      Given  I had made some design mistakes in the original deployment that were certainly causing performance issues, and we really needed a warm spare for this machine, I built a whitebox upon which I am running tests and benchmarks:

       

      • ASUS P5BV-E/SAS

      • 2.5GHz Xeon Quad-Core

      • 8G RAM

      • 750G SATA in RAID1

       

      The drives are on the LSI MPT/1064e controller on the mainboard.  So far, my testing has shown Ubuntu 8.04 to be the performance leader among the supported x64 host OS's for RC2.  Software RAID seems to beat using the MPT controller by a noticeable margin, at least on block writes/rewrites (bonnie++ 1.03).

       

       

      I took a recent backup of all the VM's from the deployed 1.x server and performed the following optimizations on the XP guests:

       

      • Converted all vmdk files to preallocated---they were originally growable.

      • Switched from IDE to LSI SCSI on the vmdk file

      • Defragged the virtual disk using JkDefrag within the guests

      • Installed the RC2 VMWare Tools (did not upgrade from Virtual Hardware 4 though)

      • Did the standard oft-cited vmx tweaks (MemTrimRate=0, etc)

       

      On the host system, I did the following optimizations:

       

      • The standard /etc/sysctl.conf VMWare tweaks (vm.swappiness=0, et al)

      • The partition with the VM's is separate from the OS, and is on faster part of the disk (i.e., the start)

      • The VM partition is ext3 with data=writeback, and set for noatime.

      • After the above vmdk tunings were complete, I loaded the VM's onto the VM partition fresh,so host fragmentation should be minimal

      • Kernel is booted with "elevator=deadline nohz=off"

       

      So ok, here I am at the end of the road in terms of knobs I can tweak.  I have a fast host machine, with the exception that it doesn't have a $5000 SCSI disk subsystem.  Oodles of RAM.  But the performance still sucks.

       

       

      I didn't go whole hog on getting every disk metric, but the basics tell enough of the story. When against the VM partition, here are the numbers:

       

      • bonnie++ shows block I/O at 79MB/s write, 40MB/s rewrite, and 102MB/s read

      • hdparm -t shows 107MB/s read

       

      Pretty decent don't you think?  Well, on the RC4 guest, the bonnie++ numbers are very different:

       

      • 16MB/s write, 7MB/s rewrite, 9MB/s read

       

      Results are similar on the done-everything-I-can XP guests:

       

      • DiskTT shows 11MB/s write, 10MB/s read on a 2G test file

       

      I know that there's some apples/oranges going on with bonnie vs DiskTT, but that can't explain the vast difference.   But even worse, while DiskTT is running the system load average spikes into the 6-8 range, mostly due to iowaits.  Now everything on the server, including the other guests, are all but stopped until DiskTT is done.

       

       

      I factored out contention between the guests by doing this test with only one XP guest running.  Nope, same speeds, same suspended animation of the server during the test. So this isn't a server that is suffering from overload.

       

       

      So I tweak, Google, tweak, Google, tweak, and Google some more.  I get the XP guests to work reasonably well, but anytime the disk gets even moderate usage the entire server slows to crawl.  I even completely disable the paging file on the XP guests, which helps mitigate really-slow-after-some-idle-time problem, since Windows is limited as to what it can swap out to the ultra-ultra-slow disk it's dealing with.

       

       

      Finally, FINALLY, I hit this posting:

       

       

      Tips for Improving Performance on Linux Host 

       

       

      And that, my friends, was the magic bullet.  Sort of.  I took a lot of info from that thread and took one extra step:

       

      • mainMem.useNamedFile = FALSE (puts the vmem file in /tmp)

      • /tmp is mounted as tmpfs with a 12G limit

       

      As the poster correctly surmises, the problem seems to be lie in the memory-mapped file that VMWare Server insists on creating and keeping up to date.  The original poster seemed to eke out decent performance by optimizing disk writes to prevent the kernel from getting saturated, but that didn't work well for me.  Nope, I had to go and force the damned memory-mapped file back into RAM. Once I did that, DiskTT gave me this happy news:

       

      • 64MB/s write, 97MB/s read,

       

      Even better, load average stayed in the low 1's, with iowait in the teens during write, and negligable during reads.  Performance on the workstations was near-native, even with the full complement of guests booted up.

       

       

      While I'm happy that I think I finally have a solution, the fact that this underlying problem exists has been driving me a bit batty for the past 24 hours.  As stated in the aforementioned posting: 

       

      I cannot claim to understand the reason for this - my host machines RAM
      is not disk-backed - so why when I tell
      VMWare to give a guest 2GB of my RAM, and not to put any in swap, does
      it back it with a file on disk? I have yet to see a convincing explanation of this apparent madness.

      That's putting it rather nicely, if you ask me.  At this point I'd have some choicer words for it, but yes, madness indeed.

       

       

      I'm now really glad I put 8G of RAM in this box.  On the production server, the reported numbers from the MUI almost never get above 1G total usage for all running VM's, with a top spike of around 3-4G. 8G was beginning to look like a waste.

       

       

      With the tmpfs solution-the only solution that has worked for me-I now have a situation where VMWare forces gratuitously wasteful RAM usage.  Why?  Because while VMWare is extremely good at minimizing how much RAM the guest's working set occupies on the host, the vmem file is always allocated to be the full RAM size of the guest.  So a 1G guest might be eating less than a 100MB while it's idle, the file up in /tmp (and because that's tmpfs, it's likely to be in RAM) is still going to be a full 1gig in size.   On top of that, a page update within the guest can result in numerous writes on the host---at least one to copy the page into the RAM used by tmpfs, plus all the overhead of getting it out of the VM and through the filesystem layer.

       

       

      Madness!

       

       

      Observations welcome.  I'd love a solution that isn't so byzantine and distasteful, so I'm all ears.  But this is looking like the way I'll be forced to go.

       

       

      (apologies for the uneven formatting, I can't seem to get the Rich Text tab to do the right thing with paragraph spacing...)

       

       

       

       

       

       

        • 1. Re: The struggle for disk IO scalability in Server 1.x and RC2
          Expert

          Observations welcome.  I'd love a solution that isn't so byzantine and distasteful, so I'm all ears.  But this is looking like the way I'll be forced to go.

           

          Thanks for posting the results of some thorough and detailed investigation.

           

          We're aware of the .vmem backing store problem. In a nutshell, the problem is that even though we use an unlinked file in /tmp to back guest main memory, the Linux kernel insists on periodically flushing out writes to the .vmem file to disk. Unfortunately I'm not personally aware of the current status of this issue since I'm not working on it.

          • 2. Re: The struggle for disk IO scalability in Server 1.x and RC2
            davidyon Lurker

             

            Thanks for the quick response.

             

             

            Yes, I've read in many places that the Linux version simply uses an unlinked file in /tmp, but it begs the question of why does the guest memory have to be disk-backed at all?  Is there some specific reason there is a memory-mapped file set up?

             

             

             

            • 3. Re: The struggle for disk IO scalability in Server 1.x and RC2
              Expert
              davidyon wrote:

               

              Thanks for the quick response.

               

               

              Yes, I've read in many places that the Linux version simply uses an unlinked file in /tmp, but it begs the question of why does the guest memory have to be disk-backed at all?  Is there some specific reason there is a memory-mapped file set up?

               

              Unfortunately I don't work on the guest memory subsystem at all. The short answer to your question is: "yes, there are a number of specific reasons why we do this", but I don't know enough to explain them to you in more detail. I believe that all of this is part of how we implement support for large amounts of guest memory, e.g. an 8GB guest, which is a lot more than could fit inside the address space of a single process on e.g. a 32-bit host. I also know that we take great pains to give the kernel/OS hints that it should not be writing out changes to the memory back to disk, e.g. by locking the pages in memory and using an unlinked file in the first place (there's no point in writing out pages that will just be lost anyway once the file handle is closed).

               

              Anyway, hopefully one of the folks working on this problem who knows a lot more about the memory subsystem than I do will comment on this thread, though I can't promise that.

              • 4. Re: The struggle for disk IO scalability in Server 1.x and RC2
                ksc Expert
                VMware Employees

                Hey.  I do work on the memory subsystem, so I'll try to answer.

                 

                Why disk-backed and not RAM-backed?  Disk-backed files have better properties at the edge cases.  First, suspend would be slower because of the copy, but let's ignore that case.  With a RAM-backed file, we would actually consume that entire amount of memory (exactly); underestimate and the VM dies at the wrong time, overestimate and we waste host memory.  Disk-backed means the estimate can move fluidly back and forth under the host OS memory manger's control.  More importantly, the host needs memory too - with disk-backed memory, the host can swap out the VM's memory to save itself, but with RAM-backed memory the host ends up using the OOM killer.  (Usually on the largest apps, e.g. your VMs).  The size isn't easy to estimate either - besides main memory, we need some memory for graphics, for the virtualization process itself, and for several other bookkeeping sources (which vary in size by number of VCPUs and number of devices, and add up to 5-25% of memory size).

                 

                Linux also has two quirks that get in our way.  First, there is no interface to use swap-backed memory directly (Windows does have such an API, and so Windows usually doesn't have this particular performance problem).  Swap-backed memory would be better than file-backed memory because the kernel is less aggressive about pushing out dirty pages.  (For that matter, even a ioctl/fcntl hint that the page will be dirtied again soon and should be flushed lazily would be great; Linux has no such hint).  Second, Linux likes to immediately page out any dirty + unmapped memory.  (XP does the same; Vista and Mac OS X flush lazily, which gives better disk performance but risks data corruption if the host OS crashes).  We have some workarounds to do less unmapping of dirty pages, but ultimately there will always be unmap traffic, and they have to be marked dirty or the host will give us back different data.

                 

                For what it's worth, I agree that a tmpfs is about the best possible solution - the Linux kernel simply does not support the better options.  (You could also tweak the config option "tmpDirectory = /path/to/tmpfs" if you want to keep the VM's unlinked files separate from everybody else using /tmp).  Another option is "MemTrimRate", which has to do with us unmapping memory every so often to reduce memory usage when the guest is not touching memory, but I don't think this will help enough.  The other solution many people use to good effect is to put /tmp and the VM directories on separate physical drives so that each drive has a distinct access pattern instead of a single messy combined pattern; this is slightly less effective than a tmpfs, but could be cheaper than more RAM.  (Use a small, cheap, non-RAID drive - swap drives don't really need the long-term data integrity guarantees.)  Also note that your RAID-1 setup makes disk writes particularly expensive; for a write-heavy workload like a swap file, another drive is a really good idea.

                 

                 

                The big-picture answer is that the Linux virtual memory subsystem is simply not tuned for running VMs.  No OS can be the best at everything.  If you really want to be at the higher ends of transfer rates for your storage devices, ESX is a much better option - ESX's virtual memory subsystem knows all about typical VM workloads and avoids the eager flush of dirty memory.  I'm guessing that ESX is not your ideal choice because you are here in the Server forum, but if you intend to run at a sustained disk utilization above ~25%, you should take another very serious look at ESX.  (Especially now that ESXi 3.5U2 is free.)

                • 5. Re: The struggle for disk IO scalability in Server 1.x and RC2
                  davidyon Lurker

                   

                  Thanks for the detailed and informative response.

                   

                   

                   

                   

                   

                  So in a nutshell, this is about memory usage optimization.  To paraphrase your answer, you use memory-mapped files (backed by a file in Linux, a file or swap in Windows) to allow you to allocate a large address space without that address space necessarily being located entirely in RAM.  I've seen how this is done in Win32 (and is a useful part of the API), and I agree I haven't seen a good way to do it in Linux, other than your memory-mapped file technique.

                   

                   

                  But if this is about memory optimization, and the best performance cure is tmpfs, then I would submit that the cure is worse than the disease.  Assuming you have a enough RAM to avoid swapping, the tmpfs solution ends up costing up to 2x the RAM allocated to the VM.  Not to mention inefficient use of memory bandwidth: many memory writes can result in a second block write (to the RAM in tmpfs) in addition to the first write within the guest.  If you don't have a enough RAM-and the tmpfs solution starts to make that a lot more likely-then you raise the chance that portions of tmpfs end up in swap, and so it's back to the same risk of saturating the kernel with IO.  Not to mention the fact that there's been a tremendous dent put into the kernel's ability to cache.

                   

                   

                  Agreed that some of this could be helped by forcing the vmem onto another spindle. It would probably raise the threshold at which the server becomes IO saturated, but clearly that would still be much lower than with tmpfs. I'd caution against the notion that you wouldn't need RAID on that extra spindle. It would make writes less expensive, and yes, you care less about data integrity since this is just a scratch area. But one of the reasons for RAID is uptime, and not RAIDing the extra spindle puts disks back on the list of single points of failure. Personally I'd rather just throw RAM at the problem.

                   

                   

                  Look, these days, RAM is cheap.  x64 Linux distributions are widely available, and choices are plentiful.   The 8G I bought for the new whitebox was just over $200.  Big whoop.  Why not give the user to option to just throw RAM at the problem?  I.e., have a VMX tunable that says: "Really, really, just allocate the RAM from system memory".  Does it restrict your ability to optimize the guest's RAM footprint in the host?  You betcha.  But there are scenarios where that is perfectly acceptable, so what's the problem with giving the user one more tool in their belt?  As I've pointed out in several ways, the tmpfs workaround is less memory efficient than just opting to turn off the VM's memory size optimizations.

                   

                   

                  So how about it?  Let's have a VMX option useMemoryMappedFile=FALSE.

                   

                   

                   

                  • 6. Re: The struggle for disk IO scalability in Server 1.x and RC2
                    ksc Expert
                    VMware Employees
                    davidyon wrote:

                    But if this is about memory optimization, and the best performance cure is tmpfs, then I would submit that the cure is worse than the disease.  Assuming you have a enough RAM to avoid swapping, the tmpfs solution ends up costing up to 2x the RAM allocated to the VM.  Not to mention inefficient use of memory bandwidth: many memory writes can result in a second block write (to the RAM in tmpfs) in addition to the first write within the guest.  If you don't have a enough RAM-and the tmpfs solution starts to make that a lot more likely-then you raise the chance that portions of tmpfs end up in swap, and so it's back to the same risk of saturating the kernel with IO.  Not to mention the fact that there's been a tremendous dent put into the kernel's ability to cache.

                     

                    The 2x usage is true for read()/write(), which have one copy in file cache and another copy on the heap.  We use mmap(MAP_SHARED), so need only the single page and no memcpy.  tmpfs uses memory in the host OS file cache, and with that mmap call we end up mapping directly to the underlying memory page, which is subsequently pinned in the page cache.  Since the kernel already allocated this memory to the tmpfs, we aren't impacting the kernel's cache any further.

                     

                    Properly used (which we do in the special case of this main memory file), tmpfs is only a 1x cost.

                     

                    Look, these days, RAM is cheap.  x64 Linux distributions are widely available, and choices are plentiful.   The 8G I bought for the new whitebox was just over $200.  Big whoop.  Why not give the user to option to just throw RAM at the problem?  I.e., have a VMX tunable that says: "Really, really, just allocate the RAM from system memory".  Does it restrict your ability to optimize the guest's RAM footprint in the host?  You betcha.  But there are scenarios where that is perfectly acceptable, so what's the problem with giving the user one more tool in their belt?  As I've pointed out in several ways, the tmpfs workaround is less memory efficient than just opting to turn off the VM's memory size optimizations.

                     

                    Well, I disagree about it being less efficient, but agree it would be nice to do this automatically.  Noted for the future :-).  And we are taking a hard look at why high amounts of virtual disk I/O cause so much traffic on the memory swap file - that is an unexpected effect.

                     

                    For implementation of just using RAM, it's not exactly trivial - kernels don't usually have large heaps from which we could get that much memory (only file caches and Windows' AWE support that much memory), the file cache is a much better place to go.  Raw RAM is doable, but not convincingly better than tmpfs.  And tmpfs has very strong appeal because the kernel already has machinery to swap it to reclaim memory.  Sure that swapping has a performance cost, but I'd rather be swapping than crashing VMs.

                    • 7. Re: The struggle for disk IO scalability in Server 1.x and RC2
                      devzero Master

                      ksc, this was some very interesting and valuable information about memory usage on linux. thanks for sharing your knowlegde

                       

                      anyway, from my personal experience, i have seen quite a noticeable number of issues when "mainmem.usenamedfile=false" was not in place - i.e. i have seen VMs running amok regarding to I/O behaviour after some days being powered on and it has been a good rule of thumb for me, to always setting this param to false.

                       

                      so, mainMem.useNamedFile = FALSE alone (i.e. without using tmpfs) has always been the magic bullet for me.

                       

                      i heard more then once, that this should make no difference regarding I/O behaviour - if /tmp and /vmwarestore is on the same device it is irrelevant if vmem is stored unlinked in /tmp or with a named file in /vmwarestore/vmdir.

                       

                      i have no real explanation, but from my personal experience this param alone makes a real difference.

                       

                      any explanation for that?