VMware Cloud Community
FishadrTMS
Contributor
Contributor

vROps Incorrectly Reporting

I have been deploying vROps since February this year but have noticed some issues since late August with many new deployments. It is mainly in relation to resources within VM's and Hosts / Clusters. Some clusters will report all information correctly and others can have all hosts missing or only report back on certain hosts and VM's.

I have attached some examples of the issues that I am seeing regularly.

The following example (server names removed) shows that the CPU Recommendation and reclaimable information is missing:

Image1.png

On other clusters or for other customers it may only show one or two hosts in the cluster and misses the others completely.

Another problem I am seeing recently is that it is massively overestimating the CPU resources for hosts and VM's. In the following example a small cluster has 56 cores and it is recommending an upgrade to 149 Cores:

Image2.png

However, if you look at the above the stress zone graph, the requirements are way under anything that is being recommended. I am seeing this on lots of VM's and hosts where the CPU recommendations are massively over what the graphs are showing.  When we look in depth on the resource utilisation they do not match the recommendations.

The following is another example:

Image3.png

Previous assessments would have shown a recommendation to 5 or 6 CPU's, suddenly we are seeing an increase of double this up to 10 but the graphs and data don;t add up. I am seeing this problem on most recent deployments.

This also impacts the host rightsizing and massively gives the wrong information:

Image4.png

Suddenly we are going from saving 30% capex to increasing hardware requirements!

Again this has only started since September and the last vROPS version I checked was running 6.0.2.27777062 Build 2777062

I know there is a new version released and have been advised that this "SHOULD" fix this problem but this was all unofficial advice.

Questions:

  • Has anyone else experienced this problem and which releases are producing the problem?
  • Can anyone confirm that the latest release fixes these problems and is stable?
  • Once you apply the update how long do we need to wait to receive consistent data? Will we need to wait 30 days until the old bad data has been replaced with true accurate data or does the update re-analyse the data and fix the reports. Or do we blow vROps away completely and do a fresh installation

I believe that the product is awesome and we have just hit a bit of a hiccup, I just have a lot of customers that I suddenly cannot explain what is going on!

Thoughts and feedback would be appreciated on the above

Reply
0 Kudos
3 Replies
greco827
Expert
Expert

I am getting some here and there missing information as well, mostly for certain metrics.  I am inquiring as to why this is, but you are not alone.

On the stress zone, you may be interpreting the data incorrectly.  Using your example, you have a VM which has 10 GHz worth of procs (4 x 2.5).  You have your stress zone set as >70%, and the stress is 134%.  The pink area of the chart is the stress zone.  Anything within the pink is a breach of the stress threshold you have set in the policy associated with this VM.  On top of breaching the stress zone, this VM has actually demanded more CPU than is even allocated to it, peaking at 11.2 GHz.

Since your average demand is only 17.31%, you may want to take a look at your policy.  The stress for a Virtual Machine is obviously set to 70%, but is the sliding analysis set to Any or Entire Range.  If any, what is the peak period of time that you set?  If Entire Range, what is the amount of time you set in the time field of the virtual machine policy?

StressZOne.jpg

How the policies are set are EVERYTHING.  vROps tells you what you tell it to tell you.  You change that to 90%, and the recommendation will change.  Change the peak period, and it should change.

If you find this or any other answer useful please mark the answer as correct or helpful https://communities.vmware.com/people/greco827/blog
Reply
0 Kudos
FishadrTMS
Contributor
Contributor

‌Thanks for the reply. All the settings are out of the box, no changes, standard installation. These shots have come from the base Analysis tab and then the Stress zone tab. In the past, the peaks have always reflected average demand and the spikes reflect the recommendations. I have selected no ranges, base console information is from the last 30 days.

However, I have noticed over the last 6 weeks that the recommendations don't match the spikes. Something has changed in the analysis. Previously the spikes in the graph would reflect the GHz recommended. in the example they don't. The Blue line recommended is 23.4 GHz but the spike demand is 11.2 GHz. With 4 CPU's configured this would point to one more vCPU being required to facilitate the spike and stress, 2 at a push. Definitely not 10!

Unless for some reason the system is reporting as if the settings have been changed - I will check this but I am seeing the same every week since the end of September, too much of a conicidence. This is also on every single deployment I have checked since then, same reports, same analyis and it doesn't add up. I am also getting customers pointing these anomilies out to me.

The Host Right Sizing information is being skewed massively when I see these anomolies, It really does seem as if some settings and thresholds have been changed but they haven't.

I was advised that there are known issues with this version and these should have been fixed with the latest version release. I just haven't had chance to verify the settings and to identify how long it will take to fix the anomilies. When I mentioned this to VMware they were aware of "problems" but wouldn't go in to too much detail as to what they know but have advised they will assist me personally but real world experience and support knowledge can usually be more beneficial.

Reply
0 Kudos
greco827
Expert
Expert

"All the settings are out of the box" ... This is really the core of the problem.  vROps is not intended to be out of the box ready to go.  This is a common mistake that leads to people tossing vROps to the side stating it reports incorrectly.

To your specific issue .... It is running at 131% of it's 10GHz.  So it needs 13GHz to be at 100%.  But the threshold is 70%.  So you need closer to 20GHz to remain below the 70% on spikes and peaks.  That's just using basic math, not the advanced algorithms that vROps uses.  I would say that it needs at least 8, based on the default policy that it is applying.

My first recommendation would be to create a new policy which meets the criteria which you look to apply.  I believe the default policy applies the 70% capacity across Any 60 minute period by default.  Perhaps you want to change it to the Entire Range, or 80%.  Or maybe you don't want spikes and peaks considered, or maybe you want it to be 120 minute range.

I would clone the default policy, play with it a bit and apply it to this VM as a test case (just create a group with this single VM being the lone member, and associate the policy to that group).  It's the easiest way to start understanding how certain settings impact vROps recommendations.

If you find this or any other answer useful please mark the answer as correct or helpful https://communities.vmware.com/people/greco827/blog
Reply
0 Kudos