VMware Cloud Community
adamjg
Hot Shot
Hot Shot

Windows disk space monitoring

Hi all, I'm looking to create an alert for disk space on Windows VMs. To break this down to the easiest way possible, how do I generate an symptom of free space on individual drives on a Windows server? (ex. C:, D:, E:, etc.)  Caveats:

- I'm not using the endpoint agent.  This is another discussion, but I've told VMware that as long as it is based on java, we're not using it.

- vROps used to have a metric for Guest File System Free (GB) but they removed it for unknown reasons.

- Super metrics don't seem to work across all drives. Example: I can create a super metric for the C: drive, but the "evaluate on instanced metric" button in the symptom definition doesn't allow this to work across multiple drives.  We have VMs with 1 drive and VMs with 20 drives, so creating 20 different super metrics and figuring out what servers to apply what metric to is not an option.

Most other monitoring products do this OOTB. For some reason VMware removed this from vROps.  If anyone has an idea of how to do this, please let me know.  Thanks!

8 Replies
daphnissov
Immortal
Immortal

I am bringing up some test VMs in my lab to see what solution I can come up with for you. Stand by.

Reply
0 Kudos
daphnissov
Immortal
Immortal

Ok, here are the results of the tests I performed. The TL;DR version is in a vROps 6.6.1 system upgraded directly from 6.3 using a single Windows 2012 R2 VM, I am seeing behavior as to be expected whereby vROps alerts apply on a per-drive-letter basis and not cumulatively or on a globally-summed basis. That is to say, if, like you have in your case, multiple volumes each mounted as a separate drive letter, the alerts are per-drive-letter. It is possible to have the same alert be active multiple times at multiple severity levels for the same VM object due to this. I will present a number of screenshots here, so be warned.

First thing here, the symptom definition.

pastedImage_0.png

This is only one of three definitions, but they're the same minus the threshold level. We're looking at the "immediate" trigger which is at 90%. I've highlighted the metric which is very specifically called "Guest File System Usage (%)". Also, the screenshot below shows the available metrics from the metric picker within this definition.

pastedImage_1.png

Note that you don't see that metric there.

Next, I have a VM which contains four drives lettered C, E, F, and G with sizes of 60GB, 20GB, 40GB, and 100GB, respectively.

pastedImage_7.png

All of the drives share the same PvSCSI controller albeit on different IDs. You can see here that, after a data collection, there are independent drive letters each with a set of metrics corresponding.

pastedImage_5.png

I then create an 18 GB file on drive E. This is intended to trip the "immediate" symptom definition which should then trip an alarm.

pastedImage_2.png

Once created and a data collection is performed, it does exactly that.

pastedImage_3.png

I've highlighted the metric at the very bottom of this image. The name is the same as that which was defined on the symptom definition and not any of the "Total"-based metrics.

Next, I want to test the hypothesis whereby guest file system space is considered cumulatively. I write a 35 GB file to F. The following result is as shown below.

pastedImage_4.png

After a collection cycle, we see the same alert has fired, but this time at a different symptom level. This is because the alarm is configured as an OR operator.

pastedImage_6.png

Above you can see the alert definition name, and two separate symptoms being maintained; a new one for drive F has been added with the threshold breaching 85%.

I then delete the newly-created file from drive F which returns the drive to an essentially "empty" state.

pastedImage_8.png

After a single collection cycle, the alarm updates once again (note the start time value and the updated time value) to remove the active symptom for drive F.

pastedImage_9.png

========================================================

So as I hope I've shown, the alarm and symptom definition should do what you want. Here are two other points, however:

  1. You cannot seem to be able to select the metric that is used to formulate the symptom definition as it does not exist.
  2. In case where this definition may not exist on your end, possibly due to an upgrade path where you have not reset out-of-box default content, I can envision a workaround.
    • Create a temporary VM with the max number of drives you would ever encounter in your environment.
    • Give them all drive labels, format, mount, etc. to present to the OS.
    • After a data collection cycle, see if all those mount points show up in vROps.
    • Create a new symptom definition replacing the default one and use your 5 GB threshold or whatever but add each drive as an OR condition.
    • Use the metric picker but select an object rather than the default list. This will populate metrics specific to your test VM allowing you to select all the possible drive letters and the same metric for each one.

Let me know if this is of any help.

daphnissov
Immortal
Immortal

Does this help? Is this not what you're seeing in your environment?

Reply
0 Kudos
adamjg
Hot Shot
Hot Shot

Sorry, I was out of town with no internet all last week. I didn't miss it as much as you might think!  Thank you for the extremely well detailed answer.  The % free part has been working for some time. What I'm looking for is exactly the same thing you did in the previous post, but on GB Free instead of %. Can you replicate the same thing with a metric of say 5GB free space?

Reply
0 Kudos
daphnissov
Immortal
Immortal

Yes, I have a solution for that but won't be able to write it up for maybe several days due to VMworld.

Reply
0 Kudos
daphnissov
Immortal
Immortal

Ok, I scratched out some time to illustrate this process. Basically, here's the high-level:

  1. Create a super metric (or import the one I'll provide you here)
  2. Duplicate it N times to cover all possible volumes (drives) you might ever have on Windows.
  3. Turn all of them on in your policy.
  4. Create a test VM with all possible drive letters mounted (can be done at any stage; doesn't have to be here but must be before step 5).
  5. Create a symptom definition using these new supermetrics.
  6. Create an alert using the new symptom definition.

Here's the super metric configuration. Either re-create this manually or use the JSON I provide. Click individual images in the post to see them full sized if they appear cut off.

pastedImage_1.png

We're basically saying to take the capacity metric and subtract from it the usage metric on a per-volume basis. Call this "C GB remaining" or whatever you like. You'll need to duplicate this however many times to cover all possible volumes you might have mounted. Clone and edit the definition. Don't try and use the object and metric picker to find a VM, just edit the text. After saving them, set the object type to apply to VMs.

pastedImage_2.png

Next, go into your policy and activate them.

pastedImage_3.png

Only change the ones where Object Type is a VM. You can opt to use them for KPIs and DTs if you wish. Make sure in your policy to ensure it's applied to your licensing group or whatever subset of your infra.

pastedImage_4.png

Go back to the super metrics and ensure they are tied to a policy.

pastedImage_5.png

Let a couple collection cycles pass and check a VM to make sure the new super metric is collecting and is accurate.

pastedImage_6.png

Now create a new symptom based on them. This is the part that kind of sucks. In order to do this, you'll have to create a VM that has all drives possible. I'd suggest just creating a throw-away Windows VM like you see in the screenshot above, create a volume for every letter and carve out like a 1 GB drive or something. It doesn't matter, you just need the volume letters to be present so vROps can collect the metrics necessary. This is the only way you can expose those in the symptom definition as I'll show below.

When you go and create your symptom definition, you won't see those super metrics.

pastedImage_8.png

Click that button I've highlighted and select your test VM that has all those drives. Now you should see those super metrics.

pastedImage_9.png

Drag ALL of your super metrics onto the designer until they're stacked up. Fill out the forms to meet your config necessary. I did this.

pastedImage_10.png

Name it and save it. When you do, it creates not one but individual symptom definitions for each one. This saves time so you don't have to save and re-enter the symptom definition designer.

Now, go create an alert definition comprised of those.

pastedImage_11.png

Drag them into the same symptom box so they look like that. Don't add another condition. Make sure the match is Any. Select your base object type, impact, etc. Save it. Go back to the policy and make sure the alert is active. It should be because it is in the default policy and that's the parent policy. Double-check just to be sure. Use that test VM and fill up a drive. Check the alerts after enough collection cycles have elapsed and you should see your alert.

pastedImage_12.png

If you expand the non-triggered symptoms, this should prove that other drives are being watched independently but the alert doesn't have those as active.

Now you have an alert that works on remaining capacity rather than remaining percentage of drive space. Also keep in mind with super metrics that they don't apply retroactively--until you create that super metric and collects upon it, it doesn't exist in the system.

adamjg
Hot Shot
Hot Shot

I wanted to reply to say thank you.  This obviously took a while to figure out and probably just as much time to document. I do appreciate you taking the time to do this. Unfortunately I was afraid that this was the only way to do it. I don't get why something that should be as simple as one metric has to be so complicated on VMware's side. It's not like reporting on free disk space is a new thing in the Windows world lol. I'm working with my SE to figure out why this metric was dropped, and to hopefully get something added back. Also unfortunately for us we have some VMs with 20 drives on them. I'm wondering if getting this set up is even worth it, when we own other products (SCOM and SolarWinds) that can do this with a matter of clicks.

Again, thanks for the reply.  I'm going to leave this thread as unanswered so I can point my SE to it and he can run it up the flagpole, but this is the best answer I've seen to a post in a long time.  I do really appreciate it.

daphnissov
Immortal
Immortal

I agree that these metrics are so simple they should exist natively. This method I outlined is obviously cumbersome and less than ideal, but it does get the job done and you have to endure the setup just a single time.

Reply
0 Kudos