Matrix_1970
Enthusiast
Enthusiast

DT Calculation too slow

Hi all,

every day, the DT Calculation Info reports that the job is still running (see the screenshots). It starts at 1:00 AM, but it finish after a lot of hours. This causes vCOPS to run too slow...

Any idea? THX

Matrix

0 Kudos
13 Replies
gabinun
Enthusiast
Enthusiast

Try rebooting the vCops vApp

GN
0 Kudos
Matrix_1970
Enthusiast
Enthusiast

The reboot resolves the issue because the DT Calculation is stopped, but at 1:00 it re-runs again. It's not a solution...

Thanks however

Matrix

0 Kudos
jddias
VMware Employee
VMware Employee

Hi Matrix,

  Can you provide more detail on your configuration?  Version? How many metrics are being collected?  What is the sizing of your vApp CPU and Mem?  What type of storage?

Visit my blog for vCloud Management tips and tricks - http://www.storagegumbo.com
0 Kudos
Matrix_1970
Enthusiast
Enthusiast

Hello jddias,

my configuration is this:

ver. 5.7.2

Analytics: 16vCPU/32GB RAM

UI: 8vCPU/16GB RAM

Full profile (metrics collected)

Metrics collected: 4725482

Storage EMC (6 disks/500GB)

It's all?

Thank you

Matrix

0 Kudos
showard1
Enthusiast
Enthusiast

Hi

Its likely related to the storage.  What kind of disks are those 6, and what kind of array is it?

Thanks

Sean

0 Kudos
Matrix_1970
Enthusiast
Enthusiast

Hi,

they are VNX. But I think the problem is another because, for example, the Disk I/O (in vSphere view of vCOPS) indicate the 1% value. CPU and RAM (of Analytics VM) is 80/81%.

I think it's important to specify that this vCOPS appliance controls 20 vCenters.

Thank you

Matrix

0 Kudos
Jahnin
VMware Employee
VMware Employee

Hi Matrix,

You can check if any of the java instances are having issues.

Look at the following VMware KB for more information, http://kb.vmware.com/kb/2032539

Also, vCops 5.7 should support about 5mil metrics without any issues. Does esxtop show normal cpu/memory/disk stats for the vCops vApp?

Thanks,
Jahnin

0 Kudos
jddias
VMware Employee
VMware Employee

Matrix_1970 wrote:

    • Hello jddias,

      my configuration is this:

      ver. 5.7.2

      Analytics: 16vCPU/32GB RAM

      UI: 8vCPU/16GB RAM

      Full profile (metrics collected)

      Metrics collected: 4725482

      Storage EMC (6 disks/500GB)

      It's all?

      Thank you

      Matrix

You are under sized.  Check the sizing guidelines - for Full Profile and 5mil metrics you are very underpowered for CPU and RAM.

Other option is to switch to Balanced Profile.

Full Profile

  • Maximum number of objects: 12K, 5 million metrics
  • Memory
    • Analytics VM: 63GB
    • UI VM: 26GB
  • vCPU
    • Analytics VM: 24 vCPU
    • UI VM: 16 vCPU
Visit my blog for vCloud Management tips and tricks - http://www.storagegumbo.com
showard1
Enthusiast
Enthusiast

Unless those drives are SSD, or you're using a lot of FAST-Cache or something, you're likely undersized on IOPS too.  If those 6 drives are 15k and in RAID-10, the maximum IOPS they can support is something like 600.  If they are SATA disks in RAID-5, its about 200.  The sizing recommendation for the analytics VM is 1500 minimum. 

Don't look at the throughput counters, look at the raw latency metrics for the analytics VM.

gradinka
VMware Employee
VMware Employee

one more thing - did this slowdown occured overnight, or was it gradually slowing down, or... ?

0 Kudos
Matrix_1970
Enthusiast
Enthusiast

Hi,

I don't know... I see every morning that this calculation is always in progress. At 10:00 AM o' clock, for example, it's 25/27%...

The job starts every night at 1:00 o' clock.

Thanks

Matrix

0 Kudos
mark_j
Virtuoso
Virtuoso

If you look beyond the CPU and Mem, you're going to 'likely' hit a Disk IO limitation as your next bottleneck. At your current metric collection levels, assuming 5min collection is still used (default), your required IOPs is 14100 IOPs between the two VMs. This is a lot of IO.

Give the size of your deployment (very large), for improving performance  you can:

1. Increase your CPU, Mem, and DIsk IOPs configuration. Usually Disk IOPs will be the limiting factor in this size deployment.

2. Reduce your metrics collection levels by reducing the quantity of metric collecting. This is done by switching to optimized attribute package and/or the default attribute packages. Also, if you're collecting from other GA adapters other than VMware vCenter you can examine those areas to trip down unwanted resources/metrics using filters or optimize attribute packages.

If you want to reduce the load in your system's DT processing, another method is to reduce the quantity of data being processed. The more historical data you have, the more buckets it needs to process. So what this means is, if you were to reduce your data retention from 6 months to 3 months, you'd have less load on Analytics VM for DT processing. Not to mentioned the lower disk space demand. If you've got such a large scale deployment, you need to decide if you want to have 3 months of data running properly or 6 months of data running improperly (incomplete DTs).. again if your Disk IO is so bound this might not even be enough.

Also, another suggestion is to upgrade to vC Ops 5.8 if you haven't already. 5.8 started implementing a newer DT calc algorithm that is a little more efficient.. every bit counts when you've got a large scale deployment.

Another suggestion is to change your DT processing time to begin the night before instead of early AM to allow more time for DT processing. So the default DT proc time is 1AM, which you could change to ~8-9PM to allow a few extra hours to process before you start using it in the AM.

When was the last time the DT process actually 'finished' on it's own? Look at the vC Ops resources, check the Analytics tier, and check the DT processing metrics to see when it was running and finished up own.

I recommend opening up a ticket with GSS to talk about this and get their thoughts. With a deployment your size, they'll bring a different perspective to the table and usually have a few engineering-sourced tips and tricks up their sleeves to tweak settings for 5-million-metric deployment optimization. However, before you start pulling in GSS I'd suggest checking your own end and ensure you've got the adequate resources provisioned (cpu,mem,disk io).

If you find this or any other answer useful please mark the answer as correct or helpful.
0 Kudos
Matrix_1970
Enthusiast
Enthusiast

HI Mark.j,

thank you for your suggestions! We have opened an SR to VMware....

However, it seems that the problem is related to ESX where vCOPS is hosted (too "slow"). We must upgrade them and modify the configuration of the two VM according to the best practices.

After we could see the difference. For now, we must do these operations and check if there are good news.

Thank you to all and to you for your patience. I'll update you asap!

Matrix

0 Kudos