Just over a week ago I had the privilege of riding along with VMware's Professional Services Organization as they piloted a possible performance offering. We are considering two possible services: one for performance troubleshooting and another for infrastructure optimization. During this trip we piloted the troubleshooting service, focusing on the customer's disappointing experience with SQL Server's performance on vSphere.
If you have read my blog entries (SQL Server Performance Problems Not Due to VMware) or heard me speak, you know that SQL performance is a major focus of my work. SQL Server is the most common source of performance discontent among our customers, yet 100% of the problems I have diagnosed were not due to vSphere. When this customer described the problem, I knew this SQL Server issue was stereotypical of my many engagements:
"We virtualized our environment nearly a year ago and and quickly determined that virtualization was not right for our SQL Servers. Performance dropped by 75% and we know this is VMware's fault because we virtualized on much newer hardware on the exact same SAN. We have since moved the SQL instance back to native."
Most professionals in the industry stop here, incorrectly bin this problem as a deficiency of virtualization, and move on with their deployments. But I know that vSphere's abilities with SQL Server are phenomenal, so I expect to make every user happy with their virtual SQL deployment. I start by challenging the assumptions and trust nothing that I have not seen for myself. Here are my first steps on the hunt for the source of the problem:
Instrument the SQL instance that has been moved back to native to profile its resource utilization. Do this by running Perfmon to collect stats on the database's memory, CPU, and disk usage.
Audit the infrastructure and document the SAN configuration. Primarily I will need RAID group and LUN configuration and an itemized list of VMDKs on each VMFS volume.
Use esxtop and vscsiStats to measure resource utilization of important VMs under peak production load.
There are about a dozen other things that I could do here, but my experience in these issues is that I can find 90% of all performance problems with just these three steps. Let me start by showing you the two RAID groups that were most important to the environment. I have greatly simplified the process of estimating these groups' performance, but the rough estimate will serve for this example:
RAID5 using 4 15K disks
4 x 200 = 800 IOPS
RAID5 using 7 10K disks
7 x 150 = 1050 IOPS
We found two SQL instances in their environment that were generating significant IO: one that had been moved back to native and one that remained in a virtual machine. By using Perfmon for the native instance and vscsiStats the virtual one, we documented the following demands during a one-hour window:
In the customer's first implementation of the virtual infrastructure, both SQL Servers, X and Y, were placed on RAID group A. But in the native configuration SQL Server X was placed on RAID group B. This meant that the storage bandwidth of the physical configuration was approximately 1850 IOPS. In the virtual configuration the two databases shared a single 800 IOPS RAID volume.
It does not take a rocket scientist to realize that users are going to complain when a critical SQL Server instances goes from 1050 IOPS to 400. And this was not news to the VI admin on-site, either. What we found as we investigated further was that virtual disks requested by the application owners were used in unexpected and undocumented ways and frequently demanded more throughput than originally estimated. In fact, through vscsiStats analysis (Using vscsiStats for Storage Performance Analysis), my contact and I were able to identify an "unused" VMDK with moderate sequential IO that we immediately recognized as log traffic. Inspection of the application's configuration confirmed this.
Despite the explosion of VMware into the data center we remain the new kid on the block. As soon as performance suffers the first reaction is to blame the new kid. But next time you see a performance problem in your production environment, I urge you to look at the issue as a consolidation challenge, and not a virtualization problem. Follow the best practices you have been using for years and you can correct this problem without needing to call me and my colleagues to town.
Of course, if you want to fly us out for to help you correct a specific problem or optimize your design, I promise we will make it worth your while.