Tuesday, February 9, 2021

Driving down cloud storage costs with Apache Solr's hybrid indexed and raw log analytics engine

Search engines are powerful tools for log analytics. They excel at slicing and dicing data over clusters and running distributed aggregations and statistical analysis. But maintaining a log analytics search index for terabytes or petabytes of log data involves running huge search clusters and incurs large cloud storage expenses to store the indexes. Often what's actually needed is a grep like capability that includes aggregations and visualization rather than the full power of the search index for historical data. 

The next release of Apache Solr provides a hybrid approach to log analytics that supports both log analytics queries over a search cluster and the ability to grep, aggregate and visualize compressed log files. 

Solr's Streaming Expressions and Math Expressions are a powerful query language for analytics and visualization. You can read about Streaming Expressions and Math Expressions in Solr's Visual Guide (https://lucene.apache.org/solr/guide/8_8/math-expressions.html). If you haven't seen this guide it's useful to quickly review the TOC to see the power here and compare to what ElasticSearch offers.

In the next release of Solr a subset of Streaming Expressions and Math Expressions can be applied to raw compressed log files using the cat function. The cat function reads files from a specific place in the filesystem and returns a stream of lines from the files. The cat function can then be wrapped by other functions to parse, filter, aggregate and visualize.

Below is a simple example of the cat function wrapped by the parseCSV function to parse a comma separated file into tuples which can be immediately visualized by Zeppelin using the Zepplin-Solr interpreter.




Reading GZipped Files

In the next release of Solr the cat function will be able read gzipped files in place without expanding on disk. The cat function will automatically read gzipped files based on the .gz file extension. Log files that are gzip compressed often have an 80% reduction in size.

On Demand Indexing

One of the capabilities provided is on-demand indexing of historical data. There will be times when the grep and aggregate functions won't be enough to support the analytics requirement. In this scenario Streaming Expressions supports a rich set of functions for on-demand indexing from raw compressed log data. The example below shows the cat function wrapped by the select function which is renaming fields in the tuples. The update function then indexes the tuples to a Solr Cloud collection.




Once the data is indexed the full power of Streaming Expressions and Math Expressions can be applied to the data.

Aggregations Over Raw Compressed Log Files

The cat function can also be wrapped by the having function to perform regex filtering, the select function to transform tuples and the hashRollup function to perform aggregations directly over compressed log files. 

Let's build a time series aggregation one step at a time:

cat("2021/01")

The cat function reads all the files inside the 2021/01 directory. These are log files from January 2021. These log files are in CSV format and are gzipped individually. 

parseCSV(cat("2021/01"))

The parseCSV function wraps the cat function and parses each CSV formatted log line into a tuple of name value pairs.

select(parseCSV(cat("2021/01")),
       trunc(timestamp, 10) as day,
       long(qtime) as query_time)

The select function transforms each tuple by truncating the timestamp field on the 10th character to return the yyyy-MM-dd part of the timestamp and mapping it to the "day" field. It also casts the qtime field to a long and maps it to the "query_time" field.

hashRollup(
    select(parseCSV(cat("2021/01")),
           trunc(timestamp, 10) as day,
           long(qtime) as query_time),
    over="day",
    avg(query_time))

Finally the hashRollup function performs an aggregation over the truncated time stamp (day field) averaging the query_time.

The output of this expression is a time series which can be immediately visualized and shared in Apache-Zeppelin using the Zeppelin-Solr interpreter.


Solr temporal graph queries for event correlation, root cause analysis and temporal anomaly detection

Temporal graph queries will be available in the 8.9 release of Apache Solr. Temporal graph queries are designed for key log analytics use ...