Using logstash to push metrics into Wavefront

Version 1


    Note: This article is obsolete. We highly recommend you refer to Sending Log Data Metrics To Wavefront instead.


    I recently joined Wavefront and as part of my learning process, I though it would be cool to gather metrics from logs and ship them to Wavefront.  This would allow me to take advantage of the scale and powerful query language of Wavefront on the logs data. I also wanted to see if I could push point tags along with this log metric data and see if I could do interesting things with this meta data. But first, I had to take baby steps and therefore I decided to write a series of articles on how to accomplish this task. I am doing this on ubuntu 14.04 desktop version.


    First thing is to get familiarized with Logstash by reading about it here : The example in the link covers a use case of parsing a static apache access log file and then sending the output to either stdout or a file. This is pretty straightforward. Now the next step was to collect metrics about events such as response codes to Wavefront. Logstash has an in-built plugin to summarize this data for us : metrics | Reference [2.3] | Elastic


    Once we have the metrics collected from the logs such as count of the 200 response code, we need to send this to wavefront. Logstash has an output plugin for writing data in graphite format: graphite | Reference [2.3] | Elastic . This was appealing since wavefront proxy can read graphite formatted data.


    Enabling Wavefront Proxy for Graphite formatted data


    By default, the wavefront proxy configuration file has the port number that listens for graphite data turned off. Before starting the proxy, we need to uncomment the lines in the "wavefront.conf" file installed by default under "/opt/wavefront/wavefront-proxy/conf"



    1. ## Which ports should listen for collectd/graphite-formatted data? 
    2. ## If you uncomment graphitePorts, make sure to uncomment and set 'graphiteFormat' and 'graphiteDelimiters' as well. 
    3. graphitePorts=2003 
    6. ## Which fields (1-based) should we extract and concatenate (with dots) as the hostname? 
    7. graphiteFormat=2 
    10. ## Which characters should be replaced by dots in the hostname, after extraction? 
    11. graphiteDelimiters=_ 
    Restart the wavefront proxy by running the commands  "service wavefront-proxy stop" and "service wavefront-proxy start". This will enable the wavefront proxy to listen to graphite formatted data.

    Configure Logstash to send data to Wavefront

    Once we are done with with the step above, we need to create a configuration file for wavefront that will parse the apache logs and then send the data to the wavefront proxy. I am assuming here that you have installed Logstash on Ubuntu and the default location is here : /opt/logstash Create a wavefront.conf file as shown under /opt/logstash:



    1. input { 
    2.     file { 
    3.         #path => "/home/parag/data/logstash-tutorial-dataset" 
    4.         path => "/var/log/apache2/access.log" 
    5.         start_position => beginning 
    6.         sincedb_path => "/dev/null" 
    7.     } 
    8. filter { 
    9.     grok { 
    10.         match => { "message" => "%{COMBINEDAPACHELOG}"
    11.     } 
    12.     geoip { 
    13.         source => "clientip" 
    14.     } 
    16. metrics { 
    17.             # A counter field 
    18.             meter => "apache.response.%{host}.%{response}" 
    19.             add_tag => "metric" 
    20.             clear_interval => "30" 
    21.             flush_interval => "30" 
    22.         } 
    23. output { 
    24.         stdout { codec => rubydebug } 
    25.         graphite { 
    26.                 host => "localhost" 
    27.                 port => 2003 
    28.                 fields_are_metrics => true 
    29.                 # only send metrics collected in the filter 
    30.                 include_metrics => ["^apache\.response\..*"
    31.         } 
    32.                 #file{ path => "/home/parag/logs/output_metric.log"


    Start logstash by issuing the following command from the root directory of your logstash install : "bin/logstash -f wavefront.conf". You should start seeing debug messages on your console.


    Charting it in Wavefront

    Login to your Wavefront instance and create a dashboard to display the count of "200" and "404" response codes. We can now use Wavefront's powerful query language to do pattern and shape matching, alerting and correlation to monitor metrics within your log files.


    Screen Shot 2016-06-13 at 1.51.04 PM.png


    Final Thoughts


    We see how easy it is to get log file data into Wavefront. In the next article, I want to explore how can I use the point tag features that Wavefront provides and apply point tags to my log data and make it more meaningful. Stay tuned.


    This is really cool stuff.


    Have you guys run any benchmarks to see what sort of volume of log lines a WF-Proxy instance can handle? Are there any new internal metrics generated by the agent that can allow us to monitor this?


    Thanks for your interest . We're working on direct ingestion over a TCP socket as well (this will open up more potential integrations since it is standard and simple). But as you can see we started with filebeat.


    Users of this feature will notice new metrics of the form "~agent.logsharvesting.*". we count basic metrics on throughput in terms of log messages. e.g. java/ at master · wavefrontHQ/java · GitHub


    We don't have any benchmarks to share at the moment, but I can tell you a few notes and expectations:


    1) There are two important dimensions for scalability of ingesting metrics from logs: how many distinct timeseries are in your metrics and the volume of logdata.

    2) Since we aggregate timeseries in memory, a log corpus with a lot of timeseries will use more memory.

    3) The groks above use a regex engine to scan every log line. That means large volumes of log data will, in theory, cause us to lag behind, since the CPU the wavefront-proxy is using could be overloaded.


    Per the above observation, performance will depend on the hardware . Roughly speaking, though: if you have a ton of timeseries in your logs, memory pressure will be the limiting factor. If you have a large volume of log data, processing speed will be the limiting factor (for servicing the regexes and decompressing the payload). You can imagine network throughput can also come in to play, but I'd expect CPU to be an issue first on most deployments.


    We'll be sure to update here with the new features and the benchmarks when they are available. Since we are just scanning through the input data using regexes and doing constant-time aggregations, there shouldn't be much overhead in the proxy application itself.


    This document was generated from the following discussion: The specified item was not found.