VMware Cloud Community
ph2013
Enthusiast
Enthusiast

Failed to send data: forward data queue is backing up, data is being lost!

vRealize Operations Version 6.2.1.3774215 Build 3774215, Large Master node and 7 Data Collectors, 5 large remote collectors

Data nodes collection status changes from data receiving to blank.  Collector.log for data nodes shows:

[35782] 2016-06-29 14:47:35,743 WARN  [DataForwarder]  com.vmware.vcops.platform.common.DataForwarder.sendData - Failed to send forward data through channel : IRawDataForwarder.NotEnoughBufferSpace: com.gemstone.gemfire.cache.CacheWriterException: FORWARD_DATA_REGION region exceed the maximum number (20000) of entries.

[35783] 2016-06-29 14:47:36,516 FATAL [DataForwarder]  com.vmware.vcops.platform.common.DataForwarder.sendData - Failed to send data: forward data queue is backing up, data is being lost!

Anyone see this before?

0 Kudos
1 Reply
sxnxr
Commander
Commander

I had this with a 4 node cluster. It was running slow and timing out after 4 months vmware harrowed it down to a problem with one node. I am convinced it started because a GSS engineer gave me a fix for a problem we didnt have. We didnt get to the bottom of it as i just redeployed the entire cluster as losing the data was not a big problem for us. Below is the mail from vmware

I received an update from Engineering regarding this issue, please find this below:

Looked at the logs and from gemfire statistics files, it appears that node xxxxxxxxx is maxed out on the cpu. The cpu usage is close to 100 percent all the time. The load average on this node is over 50 on average

So, essentially, this node xxxxxxx is blocking every other node in the cluster from collecting data and also affecting UI performance. This is owing to the distributed design of vR Ops.

Looked at the breakdown of the cpu and I see that the system cpu usage that is averaging around 40 percent. This typically happens if the environment is having issues doing IO.

One scenario is if there are too many snapshots on the disk for this vm. Can we ask the customer to check that, and if so delete the snapshots? If that is not the issue, then we need to look at the understand if there any problems with the underlying datastore that is causing high system cpu usage.

Another thing to check is if the xxxxxx is deployed on a host which is overcommitted on cpu. If that is the case, then other vms on the host might be hogging all the host cpu

Also note that, all other nodes in the cluster have normal cpu usage. So, we have to understand what is special about this xxxxxx node that is causing it to max out on cpu

0 Kudos