VMware Cloud Community
TomR_CAW
Contributor
Contributor

SQL Server Performance issues as a VM

Since converting an MSSQL server to a virtual machine we have seen severe performance drops and an array of unusual errors. Here's the scenario:

3 Dell PowerEdge 2950 servers each configured as follows:

2 PROC, Quad core Intel E5430 2.66Ghz

8GB RAM

2 onboard Broadcom GB NICs (used for service console and admin network only)

1 Intel GB 4 port NIC (2 for VMotion, 2 for Production VMs)

2 QLogic QLA2432 4GB HBAs

EMC CX300 SAN

Admin, VMotion and Production networks are physically separate, each with it's own GB switch.

We are running ESX 3.5 on all hosts with VC 2.5 setup as a VM. HA, DRS, VMotion all enabled. There are six VMs running currently arranged like so:

Host1

WebServer VM (4 vCPU, 3GB RAM, Win2K)

Host2

WebServer2 VM (1 vCPU, 1GB RAM, Win2K3), secondary sites, minimal traffic

WebServer3 VM (1 vCPU, 1GB RAM, Win2K3), not in production yet, no traffic

VirtualCenter VM (1 vCPU, 2GB RAM, Win2K3)

DataImport VM (1 vCPU, 1GB RAM, Win2K), processes data files to load into SQLServer

Host3

SQLServer VM (4 VCPU, 3GB RAM, Win2K, MSSQL2000 sp4) primary backend database

The WebServer, DataImport and SQLServer VMs were all converted using Converter Enterprise.

All VMs are stored on the SAN in LUN A, except SQLServer in LUN B.

All SQLServer databases (.mdf files) are in two RDMs to LUNs C & D.

All SQLServer log files are in the VM.

We operate approximately 200 SQL databases ranging in size from 20MB to 20GB. Most are around 750MB. There is a lot of SQL activity. We receive daily update files from clients to load into their databases. These files are stored in a separate physical server and processed by the DataImport VM which does most of the insert/update/delete functions. The web portals themselves are mostly just queries. Very few inserts or updates. The past couple weeks since converting everything to VMware we have seen a lot of performance issues. Web Portal requests routinely return ASP timeouts while waiting for responses from SQL. We have also seen a lot of other intermittent errors since the conversion like these:

"The RPC Server is unavailable" when accessing some pages.

and

"Microsoft OLE DB Provider for SQL Server error '80004005'

DBNETLIB ConnectionOpen (PreLoginHandshake()). General network error. Check your network documentation."

Both of these are intermittent and will often go away after refreshing a page. We've also seen these creep up on our web server: "Out of process application '/LM/W3SVC/1/ROOT/xxxxx' terminated unexpectedly."

After some monitoring with esxtop, VI client Performance, perfmon, and our switches the network, SAN fabric and memory do not appear to be dragging. The only thing that is jumping out at me is the SQLServer CPU usage. Since converting the VI client performance tab and esxtop is showing consistently above 90% CPU usage when our DataImport programs are running. If I shut them down the importers, usage drops to 30% or below. Also perfmon and taskman in the guest are showing barely 25% usage at the same time esxtop shows 100%.

I'm at a loss of where to look next. SQL Server was able to handle all site and importer requests perfectly prior to the conversion and that was with lesser hardware. Any help would be greatly appreciated.

Thanks.

Tom

Tags (1)
0 Kudos
48 Replies
spex
Expert
Expert

Tom, is your storage setup completely different o your former physical setup. Are you sharing the physical disks for lun's of sql server with other one's?

Are you using load balacing withing your virtual switch (which policy are you using?) Try using only one nic for production and see what happens....

Did you check the following article (irq sharing) http://www.vmware.com/resources/techresources/1061 ?

Best wishes

Spex

0 Kudos
TomR_CAW
Contributor
Contributor

We were using the CX300 in a direct attach to an old PowerEdge 2650. The databases were the only thing stored on the SAN at that time and those LUNs have not changed at all. Those are the RDMs setup on the new SQLServer VM. We added another drive cage to the CX300 when we converted. The new cage houses all the drives for the VMs. The two LUNs each encompass a 6 disk RAID5 array so the SQLServer VM is residing on it's own disk array.

I am using load balancing on each of my virtual switches based on Port ID. I'll check the article you've linked.

Thanks.

Tom

0 Kudos
TomR_CAW
Contributor
Contributor

Any chance a mod can move this thread to the ESX 3.5 forum? Looks like I posted it in the wrong area.

0 Kudos
Anders_Gregerse
Hot Shot
Hot Shot

You have only mentioned cpu, what about memory, network and disc load? You can use esxtop to see outstanding I/O in the hba queue and also nic and memory. I have a virtual MS SQL 2005 2 vcpu and 3Gb memory used for MOSS 2007 and I would also like it to run better. I had to add an extra vcpu due to bad performance, but as you might know adding vcpu's will not always improve performance due to the running in-sync (or almost in-sync).

0 Kudos
Anders_Gregerse
Hot Shot
Hot Shot

Is it a 32 or 64 bit SQL Server? Ours is a 64 bit

0 Kudos
TomR_CAW
Contributor
Contributor

I can't really find any statistics on memory, network, or disk that makes me think there is a problem there. I've used esxtop and the Performance graphs in VI Client and nothing looks excessive. Network usage on the host will spike at 2.5Mbps. Memory usage maxes out at 1.8GB (3Gb allocated to the VM) and that is just because SQL will use everything it's given. Zero ballooning as this is the only VM on the host. Disk I/O is barely touched from what I can see. When running under the heaviest load I still saw nothing in any of the hba (0, 1 or 2) queues when monitoring the host with esxtop. Monitoring the traffic on our fiber switches even showed minimal activity. The only thing that seems excessive is the CPU on the SQLServer VM. Even then the usage according to taskman and perfmon is very low.

It is a 32bit SQL Server.

0 Kudos
williambishop
Expert
Expert

Any chance you can take it back to 2 cpus, and then test again>?

--"Non Temetis Messor."
0 Kudos
TomR_CAW
Contributor
Contributor

I can try tonight. I tried reducing the number of PROCs SQL Server utiized to just 2 hoping to free up some of the resources for the OS or whatever else needed it so bad. This only made things worse. But that still had 4 vCPUs on the VM, just with 2 assigned to SQL rather than all 4.

0 Kudos
TomR_CAW
Contributor
Contributor

Reducing the SQLServer VM to 2 vCPUs resulted in even worse performance. This time I was seeing timeouts and general errors even without our data processing programs running. I set the VM back the way it was and reconfigured the WebServer1 VM to 2 vCPUs and 2GB of RAM just in case it was having an effect. I'm now at a loss to the real source. Before I was certain it was within the SQLServer VM. Now I was able to run numerous queries from Query Analyzer and most of them had no problems running. Those that did run slower were not unacceptable. However, our websites themselves are still seeing a lot of timeouts and other errors such as the RPC error and the General network error I referenced earlier.

0 Kudos
Anders_Gregerse
Hot Shot
Hot Shot

So increasing cpu's at both client and server, increase performance to an acceptable level, but before virtualizing the SQL everything ran fine? We are missing something in the SQL Server performance problem. I've read an article where they wrote about the query optimizer and that virtualizing SQL would invalidate all the performance tuning rules in the query optimizer, but that would also be true whenever technology is improved (replacing 10K rpm disks with 15K rpm disks). Have you tried using process monitor to see where all the power is going? I think I will try it when I'm at work in an hour or so. 64 bit is performing better when you have a high load and 4 vcpus (see )

0 Kudos
spex
Expert
Expert

Did you try to disable loadbalancing? Traffic between web frontend and sql db shouln'd need that...

Regards

Spex

0 Kudos
Anders_Gregerse
Hot Shot
Hot Shot

In my installation I see a match between the load in the SQL vm and what VirtualCenter is saying. I noticed that I have Data Execution Protection enabled and I'm not running fibres, but they can reduce context switching, which might help in a virtualized environment.

0 Kudos
williambishop
Expert
Expert

That is really peculiar. Like you, I'm not really certain your issue is really within the sql itself now. Try reducing the vcpu's for the web server again to 1 and see what it does. We always give our databases multiple cpu's (normally never above 2 even for our heavy hitters as peformance seems to be best at 2) but our webservers always get 1 because they seem to do better with a single vcpu.

--"Non Temetis Messor."
0 Kudos
TomR_CAW
Contributor
Contributor

Anders, that is exactly correct. Prior to virtualization the performance was excellent on lesser hardware. I have not used Process Monitor before so I'm not sure what I can extract out of the information. I tried it and found 250,000 entries to the screen within moments and cancelled the capture. According to the list of running processes in task manager sqlserver.exe is chewing up about 25%, sqlagent.exe is using 15% and System Idle Process is using 60%. This is when operating under load with our data importers running. Without load the figures are around 15%, 5% and 80% respectively.

I don't know that upgrading to a 64-bit OS and instance of SQL Server is an option at this time.

How can I make sure I have Data Execution Protection enabled and what do you mean by "not running on fibres"?

Spex, what load balancing are you referring to? NIC load balancing on the host?

william, I'll see what I can do about changing it to 1 vCPU. The real reason I suspected the SQL Server to begin with was of course the various SQL related errors (RPC and preloginhandshake as noted above) and the strange discrepency between taskman and VI client performance counters on the CPU usage. Also when the system started presenting these problems it was difficult to even execute queries from Query Analyzer directly from the SQLServer VM. I had trouble even pulling down the list of DBs within Query Analyzer. This does not seem to be the case anymore. I'm starting to wonder if it's somewhere in the communication between the WebServer VM and the SQLServer VM. Performance from my data importers on their VM is fine (even better in some cases). My WebServer2 I listed above has three sites that use small SQL databases on the SQLServer VM. These have no problems at all with performance to my knowledge. I'm going to do some further testing by duplicating a large database with access from a freshly installed Windows system and see if it incurs the performance hits or errors when the SQLServer is under load like the existing sites do.

Thanks for all your input guys. Keep it coming please.

0 Kudos
Anders_Gregerse
Hot Shot
Hot Shot

Have you made sure that all hardware specific drivers and vendor specific application (management agent, e.g.) have been removed from the virtualized server? These can also make life difficult. The 64-bit is just an option if the current performance is unacceptable and we can't find any other solution.

Data Execution Protection is set on the system (Right-click My Computer, properties, somewhere on one of the tabs) and in SQL 2005 it is also mentioned on the SQL Server properties.

In SQL Server you can run threads or fibres (disables CLR if you're running SQL2005). On the SQL Server properties, you can enable and disable fibres. Enabling fibres might improve your performance (based on the helptext saying it reduces context switching).

Have you checked duplex setting on the network links on both the hosts and switches, they can also cause some nasty problems and it can look like your symptoms as well. Just to make sure.

0 Kudos
TomR_CAW
Contributor
Contributor

I could not find any setup in Windows 2000 for Data Execution Protection. I did see it in Windows 2003 where you described. I have not tried fibers yet. All NICs and switches are running at 1000MB Full duplex.

I created a new VM with Windows Server 2003 Enterprise R2, fully patched with VMTools installed. 1 vCPU, 1024MB RAM. I then setup IIS 6 and put a dummy site up with access to a duplicate database on the existing SQL Server. When our data importers are running this system also experienced issues with timeouts, and all of the errors described at the beginning. So I'm back to thinking it's the SQLServer VM.

0 Kudos
williambishop
Expert
Expert

I may be hazy(in fact I know I am) but is this a new buildout of vmware, or are other vm's behaving as expected?

--"Non Temetis Messor."
0 Kudos
Anders_Gregerse
Hot Shot
Hot Shot

Perhaps it would be worth creating a 2003 with SQL 2000 and copy a database to see if the problems persist in a clean installation.

0 Kudos
TomR_CAW
Contributor
Contributor

william, this is a new install ov vmware with most of the VMs converted from physical machines using Converter Enterprise. Other VMs appear to be working fine, but I just may not have noticed any problems since they are not under heavy load. We are getting strange errors on our Web server once in a while. "Out-of-process" errors. The exact message is on the first post of this thread.

0 Kudos