BjornJohansson
Enthusiast
Enthusiast

Help me understand why SIOC and shares did not protect my storage

Hi all,

Today we had a SQL that went bananas on our storage. Basically affected all VM's, when we killed the server the problem went away.


While the problem is now solved (some dev messed up), I would like to understand why SIOC or modify of disk shares had no effect.


Environment:

  • ESXi 6.0 hosts running on HP BL460c Gen9 blade servers
  • NetApp MetroCluster running in Active/Active over Fibre Channel (with some aggregates/datastores not being replicated)
  • Datastores via NFS
  • SIOC enabled with 25 ms latency setting

To begin with:

Metro storage does not support SIOC, we know. Also mixing workloads on same disks (for example CIFS shares mixed with VMware workload) or internal jobs like deduplication may affect SIOC. The point is here, the problematic VM resides on a non-replicated datastore. No dedup jobs, backup jobs, snapshots etc. were taken during problems. Still the only SIOC events we can see on datastore were: "An unmanaged I/O workload is detected on a SIOC-enabled datastore".


Problem:

When problem was on going we could see VM write latency between 10-1000 ms. Also read latency jumped. NetApp showed lower values, but had 100% drive util and back to backs.

Since we found the bully VM, we let it run and start modifying disk shares on the DB disk but nothing happened. We also capped the IOPS to 500 without any affect.

Looking at performance on the NetApp:

300-350 MB/s in write throughput

3000 IOPS

10-100 ms i write latency

15-70 ms i read latency

There are 24 physical disks backing this datastore.


Can someone with better experience please help me understand this better? I know that some questions should be directed to NetApp, but SIOC and shares are relevant I think.

Thank you.

/BL