SAP/SQL DB Poor Performance - OS/Hypervisor Problem??

INF Platform : SAP, SQL 2008 ENT, Windows 2008 R2, Cisco UCS, Nexus, NetApp Cluster

VMware Platform : ESXi 5.5U2, iSCSI VMFS Datastores

I am stuck trying to troubleshoot our database performance issues - consider this a challenge. During heavy database use, we're seeing HUGE amounts of latency in SAP and the OS. The trouble is, all infrastructure stats show otherwise - esxtop shows low latency in the VM (LAT/rd, LAT/wr), LUN/Datastore (DQLEN, ACTV, QUED, LOAD) and iSCSI/HBA (KAVG/cmd,GAVG/cmd) adapter stats. The Netapp shows normal latency and correlates with the VM hosts (esxtop) stats almost directly, meaning that the vm host stats are legitimate as far as the round trip from the VM host to the backend storage is concerned. Based on what SAP/SQL is showing for latency, I would have expected to see high latency in the host, network and/or storage as well - that is NOT the case. The SAP DEV team maintains that its simply a case of poor IO, but the INF based numbers have not supported that conclusion IMO.

From within the OS, the perfmon stats correlate with what the SQL database (and consequently SAP) show - in other words, SQL and OS disk stats are consistent, and at the same time, the VM host and rest of the infrastructure latency stats are consistent (and acceptable). My conclusion is that there is an issue between the OS and the hypervisor layer, but I'm not completely sure if this is correct or how it can be addressed. During heavy IO, the disk queues from within perfmon show large queue depth (100+) during poor performance, indicating to me that the hypervisor cannot keep up with the IO requests .. effectively causing a backlog at the OS layer, not the lower INF layer.

A couple of ideas I've come across are to increase the amount of VM SCSI controllers (max of 4, use PVSCSI for data files), increase VM RAM in order to reduce disk IO, add addt'l datastores (for the data files) in order to multi-thread the IO requests between the iscsi adapter(s) and the storage and finally, tune the Database and Storage itself (ie. reorg, table stats, etc.).

Has anyone else run into this with heavy IO databases? Do you have any other suggestions or did you come to other conclusions? Did you find that the OS to Hypervisor layer was the suspect? Obviously upgrading the storage, network or other INF could help, but if the problem lies in between the OS and Hypervisor and is simply a matter of not being able to process enough IOPs through VMware, then that's not a great use of money.


0 Kudos
0 Replies