Have a server that's alerting for "One or more virtual machine guest file systems are running CRITICALLY low on disk space". This condition was present about 12hrs ago for a brief time, but since, the disk has been cleared off. However every time I cancel this alert, it comes back after the next 5min collection.
Any thoughts?
Check the symptom definitions of the specific alert which is being triggered. It may still have a breach if it is based on a static usage amount or a percentage which wasn't met.
Also check and make sure it isn't a different partition on the same VM which is causing the alert.
Hmmmm. OK. And the disk space you added on the OS was rescanned, and the partitions from the OS itself show adequate space, correct? What are the partition sizes/free space? Is VMware Tools Running?
VMware Tools is running, yes.
This one has me stumped, but I'm looking to see if I can mimick this behavior. The VM has tools running, it appears to be collecting, and vROps seems aware of the disk space addition since it is reflected in the metrics. It appears that the symptom and message event definitions are default, so I'm really not sure. I'll keep digging around.
Thank you, much appreciated. Could this have to do with how far back in time this checks?
What policy is applied to this VM? Has that policy (even if it is Default Policy) been altered? You can go into the policy itself under Override Alert/Symptom Definitions and make sure that there is no override setting configured on the policy applied to the VM in question.
This is a copy of the default policy, no override settings to any vms.
OK, it was worth a quick check. I'm still looking at why this might be happening. I was able to add files to a VM, create this alert then delete the files and the alert cleared.
Really... I'm racking my brain here. Very odd.
Just noticed this, but in your image, why are one of your alerts marked as do not inherit?
Also, what is the definition of the one starting with (DCs)?
That's marked as do not inherit because by default it's set to alert on 'Any' of:
Guest file system space usage at immediate level
Guest file system space usage at warning level
Guest file system space usage at critical level
So we cloned that definition, and set it to alert on 'Only':
Guest file system space usage at critical level
The (DCs) definition is a modification for only our Domain Controller. It's applied to only our 'DC' custom group.
We are seeing multiple cases of the same problem too - vROps version 6.0.1 and also 6.1 affected. Metrics are showing disk space is clear and down below 25% usage, but critical alert > 98% still flagging.
Have manually cancelled alert, restarted vSphere adapters, stops and restarted collections on the VM (via Environment Overview) and no success on this occasion. (Sometimes, however, restarting the collection has cleared the alerts ok on other VMs)
We have also stopped and restarted VMTools.
Any suggestions would be greatly appreciated!
Paul
I may have stumbled across something....
I placed the VMs into Maintenance using Environment Overview (Inventory Explorer) and the alerts cleared as expected. (Previously, stopping collection either by the solution/adapter or individual VM object did not clear the alert which confused me)
I then ended the Maintenance and the alerts have not resurfaced. Metrics all appear to be collecting ok so I'm now hopefully that this has fixed the problem.
Let me know if this helps you as well!
Paul
Paul, I also stumbled on to something too. Seems an alert for the server that was an issue had been previously been 'assigned ownership'. Once I release that ownership, deleted the alert, it hasn't resurfaced.
Ah ok - that was not the case for us as we take ownership of alerts within our team to show that they are being investigated and to avoid engineers duplicating effort. (Annoying that ownership cannot be transferred unless original owner releases ownership themselves)
To us, it looks like vROps "forgets" to clear the alert (or thinks that it already has) and no number of collections or metrics seems to force the system to recheck against old alerts. We believe that placing into maintenance forces all alerts to clear (regardless of status) and therefore "flushes" out the system.
We've fixed this alerting problem for at least 4 VMs today successfully using this method.
Awesome Paul, nice job. I'm trying that as well.
If you are running Windows NT Based Operating system , then you can configure the Perfmon inside the VM and see if any genuine High Disk Usage is reported or not .
It'll help you to remove the odd man out !!!