BjornJohansson
Enthusiast
Enthusiast

Help me understand why SIOC and shares did not protect my storage

Jump to solution

Hi all,

Today we had a SQL that went bananas on our storage. Basically affected all VM's, when we killed the server the problem went away.


While the problem is now solved (some dev messed up), I would like to understand why SIOC or modify of disk shares had no effect.


Environment:

  • ESXi 6.0 hosts running on HP BL460c Gen9 blade servers
  • NetApp MetroCluster running in Active/Active over Fibre Channel (with some aggregates/datastores not being replicated)
  • Datastores via NFS
  • SIOC enabled with 25 ms latency setting

To begin with:

Metro storage does not support SIOC, we know. Also mixing workloads on same disks (for example CIFS shares mixed with VMware workload) or internal jobs like deduplication may affect SIOC. The point is here, the problematic VM resides on a non-replicated datastore. No dedup jobs, backup jobs, snapshots etc. were taken during problems. Still the only SIOC events we can see on datastore were: "An unmanaged I/O workload is detected on a SIOC-enabled datastore".


Problem:

When problem was on going we could see VM write latency between 10-1000 ms. Also read latency jumped. NetApp showed lower values, but had 100% drive util and back to backs.

Since we found the bully VM, we let it run and start modifying disk shares on the DB disk but nothing happened. We also capped the IOPS to 500 without any affect.

Looking at performance on the NetApp:

300-350 MB/s in write throughput

3000 IOPS

10-100 ms i write latency

15-70 ms i read latency

There are 24 physical disks backing this datastore.


Can someone with better experience please help me understand this better? I know that some questions should be directed to NetApp, but SIOC and shares are relevant I think.

Thank you.

/BL

1 Solution

Accepted Solutions
MattiasN81
Hot Shot
Hot Shot

There are several factors that comes into play here.

SIOC itself only acts on ESXi workloads and not other workloads handled by other storage operations such as RAID rebuilds, CIFS workloads and so on, in your case according to the message "An unmanaged I/O workload is detected on a SIOC-enabled datastore"  SIOC detected a workload above specified threshold (25ms) but because ESXi detected the workload as non-esxi workload SIOC couldn't do anything with it other than report it.

Here is where the tricky part comes in.

In this case is actually was a VM that caused the high latency witch we deadly humans wound would say "Hey, a VM caused it so its sure as hell an esxi workload" well thats not entirely true.

Depending what type of workload and how the storage array handles it plays a part how SIOC will react on it.therefore is crucial to have an array/solution that is supported with SIOC

I can take an example from my own experience with SIOC and an EMC array running an unsupported setup with auto-tiering and FAST cache

The problem was the same as yours, a VM did some stuff that resulted in high latency, the problem wasn't the VMs workload perse but when the VM started to do its thing the storage array did what it was supposed to do, place hot data in the cache move cold data to disks and kick in a tiering job, due to the extremely high workload on the VM the array couldn't keep up and the result was from VMwares perspective high latency on that datastore but SIOC couldn't do anything because is was never the VMs that caused the latency but storage operations in the backend.

I hopes this clarify a little how SIOC operates

VMware Certified Professional 6 - DCV VMware VTSP Software Defined Storage Dell Blade Server Solutions - EMEA Certified Dell PowerEdge Server Solutions - EMEA Certfied Dell Certified Storage Deployment Professional Dell EMC Proven Professional If you found my answers useful please consider marking them as Helpful or Correct

View solution in original post

4 Replies
MattiasN81
Hot Shot
Hot Shot

There are several factors that comes into play here.

SIOC itself only acts on ESXi workloads and not other workloads handled by other storage operations such as RAID rebuilds, CIFS workloads and so on, in your case according to the message "An unmanaged I/O workload is detected on a SIOC-enabled datastore"  SIOC detected a workload above specified threshold (25ms) but because ESXi detected the workload as non-esxi workload SIOC couldn't do anything with it other than report it.

Here is where the tricky part comes in.

In this case is actually was a VM that caused the high latency witch we deadly humans wound would say "Hey, a VM caused it so its sure as hell an esxi workload" well thats not entirely true.

Depending what type of workload and how the storage array handles it plays a part how SIOC will react on it.therefore is crucial to have an array/solution that is supported with SIOC

I can take an example from my own experience with SIOC and an EMC array running an unsupported setup with auto-tiering and FAST cache

The problem was the same as yours, a VM did some stuff that resulted in high latency, the problem wasn't the VMs workload perse but when the VM started to do its thing the storage array did what it was supposed to do, place hot data in the cache move cold data to disks and kick in a tiering job, due to the extremely high workload on the VM the array couldn't keep up and the result was from VMwares perspective high latency on that datastore but SIOC couldn't do anything because is was never the VMs that caused the latency but storage operations in the backend.

I hopes this clarify a little how SIOC operates

VMware Certified Professional 6 - DCV VMware VTSP Software Defined Storage Dell Blade Server Solutions - EMEA Certified Dell PowerEdge Server Solutions - EMEA Certfied Dell Certified Storage Deployment Professional Dell EMC Proven Professional If you found my answers useful please consider marking them as Helpful or Correct

View solution in original post

BjornJohansson
Enthusiast
Enthusiast

Hi Mattias and thank you for you reply!

Ok, sounds logical, but I can't figure out what the NetApp controller did in response to get the I/O identified as a non-esxi workload. It do have a read cache that could probably interfere, but the VM were constantly writing. Will check with NetApp.

Any idea why shares didn't work? Did I set them too high?

Would be nice to be able to cap a VM without killing it, if this happens again.

Thanks,

/Bjorn

0 Kudos
MattiasN81
Hot Shot
Hot Shot

Did you set a share value on all the virtual disks on the SIOC enabled datastore ?

Capture.JPG

If you change to low or set a IOPS limit on a VM you can at least have some control if a VM starts writing/reading like a lunatic.

VMware Certified Professional 6 - DCV VMware VTSP Software Defined Storage Dell Blade Server Solutions - EMEA Certified Dell PowerEdge Server Solutions - EMEA Certfied Dell Certified Storage Deployment Professional Dell EMC Proven Professional If you found my answers useful please consider marking them as Helpful or Correct
0 Kudos
BjornJohansson
Enthusiast
Enthusiast

Indeed I did change those values without any impact. Could only test for a short while since we had complaints from other customers. If I recall correctly the VM only generated ~480 IOPS but had constantly 300 MB/s in writes to disk. Setting IOPS limit to 100 would have possibly been better than the 500 cap I tried.

Thanks

0 Kudos