VMware Cloud Community
timalexanderINV
Enthusiast
Enthusiast
Jump to solution

Finding IO hog in vSAN

We have recently built a 10 node vSAN 6.2 cluster for our development and test environment.  Have deployed SexiGraf and so far everything in the cluster looks ok.  That said I have one host that is getting very strange IO:

pastedImage_0.png

Is there any way to see the what VM is behind this IO or what objects this ESXi host is the owner of?  From my understanding it does not have to be a VM running on the host as data locality could be any host in the cluster.

0 Kudos
1 Solution

Accepted Solutions
TheBobkin
Champion
Champion
Jump to solution

Hello Tim,

Correct it is not necessarily a VM(s) running on this host just any that have data components residing on this hosts disk-groups.

So, as with any performance issues a few things need to be established:

- Do these readings vary greatly from a longer-term workload baseline?

- Is this increased load negatively impacting performance of other VMs or just stands-out on a graph?

- Do these high-readings correlate with any specific times or activities? (e.g. Back-up jobs, provisioning/creating VMs, 9AM log-in/boot-storms, large resyncs, huge file-server transfers).

You could start looking at potential VMs that are causing increased load by looking at :

- VM metrics in vCenter:

pubs.vmware.com/vsphere-60/topic/com.vmware.ICbase/PDF/vsphere-esxi-vcenter-server-60-monitoring-performance-guide.pdf

- vSAN Observer (or Sexigraf if it gives drill-down of each disk) - specifically if you can identify single disks/disk-groups that see huge IO and at the same time see similarly high usage on another hosts disk/disk-group, then the data components of the VM responsible will be on both of these (assuming FTT=1 Objects).

- esxtop (for both disk IO and VM IO)

Great resource for this from depping:

http://www.yellow-bricks.com/esxtop/

You can also set up cron job to measure over a period of time:

kb.vmware.com/kb/1033346

Bob

-o- If you found this comment useful please click the 'Helpful' button and/or select as 'Answer' if you consider it so, please ask follow-up questions if you have any -o-

View solution in original post

0 Kudos
2 Replies
TheBobkin
Champion
Champion
Jump to solution

Hello Tim,

Correct it is not necessarily a VM(s) running on this host just any that have data components residing on this hosts disk-groups.

So, as with any performance issues a few things need to be established:

- Do these readings vary greatly from a longer-term workload baseline?

- Is this increased load negatively impacting performance of other VMs or just stands-out on a graph?

- Do these high-readings correlate with any specific times or activities? (e.g. Back-up jobs, provisioning/creating VMs, 9AM log-in/boot-storms, large resyncs, huge file-server transfers).

You could start looking at potential VMs that are causing increased load by looking at :

- VM metrics in vCenter:

pubs.vmware.com/vsphere-60/topic/com.vmware.ICbase/PDF/vsphere-esxi-vcenter-server-60-monitoring-performance-guide.pdf

- vSAN Observer (or Sexigraf if it gives drill-down of each disk) - specifically if you can identify single disks/disk-groups that see huge IO and at the same time see similarly high usage on another hosts disk/disk-group, then the data components of the VM responsible will be on both of these (assuming FTT=1 Objects).

- esxtop (for both disk IO and VM IO)

Great resource for this from depping:

http://www.yellow-bricks.com/esxtop/

You can also set up cron job to measure over a period of time:

kb.vmware.com/kb/1033346

Bob

-o- If you found this comment useful please click the 'Helpful' button and/or select as 'Answer' if you consider it so, please ask follow-up questions if you have any -o-

0 Kudos
timalexanderINV
Enthusiast
Enthusiast
Jump to solution

So had to use a combo of SexiGraf and vSAN Observer.  Was able to show the "Top N vmdks" in SexiGraf and then was able to collate those with the VM tab in vSAN Observer to pin point that indeed the objects were on the correct host etc etc.  Asked the app owner about the server and what it was doing and it has miraculously stopped......  Thanks for the pointers.

0 Kudos