VMware Cloud Community
ServiceOptimi
Hot Shot
Hot Shot
Jump to solution

BP for DT calcs

I have been having DT calculation issues lately and cpu at 100%. I have an SR open and working with support. In the meantime... I see my distribution tier (admin/support/status tab) shows the MQ resource queue and broker are my top root cause problems. When I have higher enqueue.... is that always bad? It seems to coincide with the brokers dequeue.... that should be normal activity. Right?  Pic shows last 90 days and changes to enqueue and dequeue.

Reply
0 Kudos
1 Solution

Accepted Solutions
IamTHEvilONE
Immortal
Immortal
Jump to solution

Please be aware that this is only explaining one of many possible situations that could be causing the symptoms you are seeing.

Enqueue and Dequeue are metrics for data going into and out of a queue.  As long as they are similar, where Dequeue is slightly timeshifted (later) than Enqueue ... then things are still moving.

The one metric that is easier to understand is DataQueue > Queue Size.  This is the total number of messages currently in the queue, which have yet to be processed by the analytics engine.

As long as you are on 5.0.3, the following is a generalization of what will occur with DT processing.

What can happen, is that DT runs and starts to consume CPU resources.  It will consume all the resources it can get.  In a 'large' scenario, there are 6 DT threads by default ... which will consume almost all CPU resources available in the VM.

Now these CPU resources are not available to process information from the DataQueue, so some messages will be waiting to get consumed.

Once Queue Size hits a certain value, DT will throttle itself to ensure that we don't make a huge backlog. So we are trading off time to completion of DT for sake of being up to date on data.

What is hard to tell:

1. What is the limiting factor?  Is it CPU resources, lack of memory, Disk IO availability, etc.  This evaluation can be done by Support.

2. Once we identify the limiting factor, we need to address it.  (Add vCPUs/Memory/Disk Threads, etc)

3. Then trigger DT again (can be done from the Custom UI), and monitor.

It's a progressive troubleshooting nature.

Things to remember:

1. We really only need one DT every other day (48 hours), since the DT processing predicts over 48 hours into the future.  If we run once a day, we can adjust for new data.  We do it once a day to make sure we have an optimal data set for alerting, anomalies, etc.

2. We may not be limited at the Application Layer ... it might be at the VM/OS layer providing resources, or even the ESX host not being able to provide the raw compute resources to satisfy the requirements of the VM itself.  This is part of the diagnosis process.

3. We need to know the number of Metrics actively being collected.  This can be obtained from an Audit report in the custom UI, and look at the line for number of metrics collecting.  This helps create an approximation for how many resources are required.

4. This is a progressive troubleshooting effort.  We have only tested up to 2.5 million metrics collecting in a single instance of the vApp.  Have people gone over this?  Yes.  Will it work?  Probably, but I would try to get under the 2.5 million mark for sake of supportability.

View solution in original post

Reply
0 Kudos
3 Replies
IamTHEvilONE
Immortal
Immortal
Jump to solution

Please be aware that this is only explaining one of many possible situations that could be causing the symptoms you are seeing.

Enqueue and Dequeue are metrics for data going into and out of a queue.  As long as they are similar, where Dequeue is slightly timeshifted (later) than Enqueue ... then things are still moving.

The one metric that is easier to understand is DataQueue > Queue Size.  This is the total number of messages currently in the queue, which have yet to be processed by the analytics engine.

As long as you are on 5.0.3, the following is a generalization of what will occur with DT processing.

What can happen, is that DT runs and starts to consume CPU resources.  It will consume all the resources it can get.  In a 'large' scenario, there are 6 DT threads by default ... which will consume almost all CPU resources available in the VM.

Now these CPU resources are not available to process information from the DataQueue, so some messages will be waiting to get consumed.

Once Queue Size hits a certain value, DT will throttle itself to ensure that we don't make a huge backlog. So we are trading off time to completion of DT for sake of being up to date on data.

What is hard to tell:

1. What is the limiting factor?  Is it CPU resources, lack of memory, Disk IO availability, etc.  This evaluation can be done by Support.

2. Once we identify the limiting factor, we need to address it.  (Add vCPUs/Memory/Disk Threads, etc)

3. Then trigger DT again (can be done from the Custom UI), and monitor.

It's a progressive troubleshooting nature.

Things to remember:

1. We really only need one DT every other day (48 hours), since the DT processing predicts over 48 hours into the future.  If we run once a day, we can adjust for new data.  We do it once a day to make sure we have an optimal data set for alerting, anomalies, etc.

2. We may not be limited at the Application Layer ... it might be at the VM/OS layer providing resources, or even the ESX host not being able to provide the raw compute resources to satisfy the requirements of the VM itself.  This is part of the diagnosis process.

3. We need to know the number of Metrics actively being collected.  This can be obtained from an Audit report in the custom UI, and look at the line for number of metrics collecting.  This helps create an approximation for how many resources are required.

4. This is a progressive troubleshooting effort.  We have only tested up to 2.5 million metrics collecting in a single instance of the vApp.  Have people gone over this?  Yes.  Will it work?  Probably, but I would try to get under the 2.5 million mark for sake of supportability.

Reply
0 Kudos
ServiceOptimi
Hot Shot
Hot Shot
Jump to solution

Thanks for the xxcellent answer and I will continue to work thru the support team. I have had a ticket open for longer than I want to write down... partly  holiday... but things are progressing slowly. What concerns me is that  vmware support doesnt have an approach to address this.... I cannot be  the only person that has 2.5 mil metrics on 502 platform running esx4.

Reply
0 Kudos
IamTHEvilONE
Immortal
Immortal
Jump to solution

I have seen more than a handful above 2.5 million metrics.  It's a question of which bottleneck you hit next.

The problem is that vHardware 7 version is limiting.  8 vCPUs is simply not enough power to drive over 2.5 million metrics through the system.

This can be alleviated by going to to vHardware 8, to get to 10/12 vCPUs (license pending).  Then add memory and configuration changes to handle it.  Just know that the Analytics VM eats resources for breakfast, and spits out IOps for lunch.

Can you PM me the Ticket # ... maybe I can ping the tech to get a pulse on it.

Reply
0 Kudos