Monday, May 1, 2017

Exploring Solr's New Time Series and Math Expressions

In Solr 6.6 the Streaming Expression library has added support for time series and math expressions. This blog will walk through an example of how to use these exciting features.


Time Series


Time series aggregations are supported through the timeseries Streaming Expression. The timeseries expression uses the json facet api under the covers so the syntax will be familiar if you've used Solr date range syntax.

Here is the basic syntax:

timeseries(collection, 
                 field="test_dt", 
                 q="*:*",
                 start="2012-05-01T00:00:00Z",
                 end="2012-06-30T23:59:59Z",
                 gap="+1MONTH", 
                 count(*))

When sent to Solr this expression will return results that look like this:

{ "result-set": { "docs": [ { "test_dt": "2012-05-01T00:00:00Z", "count(*)": 247007 }, { "test_dt": "2012-06-01T00:00:00Z", "count(*)": 247994 }, { "EOF": true, "RESPONSE_TIME": 9 } ] } }

Solr takes care of the date math and builds the time range buckets automatically. Solr also fills in any gaps in the range with buckets automatically and adds zero aggregation values. Any Solr query can be used to select the records. 

The supported aggregations are: count(*), sum(field), avg(field), min(field), max(field).

The timeseries function is quite powerful on it's own, but it grows in power when combined with math expressions.


Math Expressions


In Solr 6.6 the Streaming Expression library also adds math expressions. This is a larger topic then one blog can cover, but I'll hit some of highlights by slowly building up a math expression.


Let and Get


The fun begins with the let and get expressions. let is used to assign tuple streams to variables and get is used to retrieve the stream later in the expression. Here is the most basic example:


let(a=timeseries(collection, field="test_dt", q="*:*",
                          start="2012-05-01T00:00:00Z",
                          end="2012-06-30T23:59:59Z",
                          gap="+1MONTH", 
                          count(*)),
      get(a))

In the example above the timeseries expression is being set to the variable a. Then the get expression is used to turn the variable a back into a stream.

The let expression allows you to set any number of variables, and assign a single Streaming Expression to run the program logic. The expression that runs the program logic has access to the variables. The basic structure of let is:

let(a=expr,
     b=expr,
     c=expr,
     expr)

The first three name/value pairs are setting variables and the final expression is the program logic that will use the variables.

If we send the let expression with the timeseries to Solr it returns with:

{ "result-set": { "docs": [ { "test_dt": "2012-05-01T00:00:00Z", "count(*)": 247007 }, { "test_dt": "2012-06-01T00:00:00Z", "count(*)": 247994 }, { "EOF": true, "RESPONSE_TIME": 9 } ] } }

This is the exact same response we would get if we sent the timeseries expression alone. Thats because all we did was assign the expression to a variable and use get to stream out the results.

Implementation Note: Under the covers the let expression sets each variable by executing the expressions and adding the tuples to a list. It then maps the variable name to the list in memory so that it can be retrieved by the variable name. So in memory Streams are converted to lists of tuples.


The Select Expression


The select expression has been around for a long time, but it now plays a central role in math expressions. The select expression wraps another expression and applies a list of Stream Evaluators to each tuple. Stream Evaluators perform operations on the tuples. 

The Streaming Expression library now includes a base set of numeric evaluators for performing math on tuples. Here is an example of select in action:

let(a=timeseries(collection, field="test_dt", q="*:*",
                          start="2012-05-01T00:00:00Z",
                          end="2012-06-30T23:59:59Z",
                          gap="+1MONTH", 
                          count(*)),
      b=select(get(a),  
                     mult(-1, count(*)) as negativeCount,
                     test_dt),
      get(b))

In the example above we've set a timeseries to variable a.

Then we are doing something really interesting with variable b. We are transforming the timeseries tuples stored in variable a with the select expression. 

The select expression is reading all the tuples from the get(a) expression and applying the mult stream evaluator to each tuple. The mult Streaming Evaluator is multiplying -1 to the value in the count(*) field of the tuples and assigning it to the field negativeCount. Select is also outputting the test_dt field from the tuples.

The transformed tuples are then assigned to variable b.

Then get(b) is used to output the transformed tuples. If you send this expression to Solr it outputs:

{ "result-set": { "docs": [ { "test_dt": "2012-05-01T00:00:00Z", "negativeCount": -247007 }, { "test_dt": "2012-06-01T00:00:00Z", "negativeCount": -247994 }, { "EOF": true, "RESPONSE_TIME": 9 } ] } }

Implementation Note: The get expression creates new tuples when it streams tuples from a variable. So you never have to worry about side effects. In the example above variable a was unchanged when the tuples were transformed and assigned to variable b.



The Tuple Expression


The basic data structure of Streaming Expressions is a Tuple. A Tuple is a set of name/value pairs. In the 6.6 release of Solr there is a Tuple expression which allows you to create your own output tuple. Here is the sample syntax:

let(a=timeseries(collection, field="test_dt", q="*:*",
                          start="2012-05-01T00:00:00Z",
                          end="2012-06-30T23:59:59Z",
                          gap="+1MONTH", 
                          count(*)),
      b=select(get(a),  
                     mult(-1, count(*)) as negativeCount,
                     test_dt),
      tuple(seriesA=a,
               seriesB=b))

The example above defines an output tuple with two fields: seriesA and seriesB, both of these fields have been assigned a variable. Remember that variables a and b are pointers to lists of tuples. This is exactly how they will be output by the tuple expression.

If you send the expression above to Solr it will respond with:

{ "result-set": { "docs": [ { "seriesA": [ { "test_dt": "2012-05-01T00:00:00Z", "count(*)": 247007 }, { "test_dt": "2012-06-01T00:00:00Z", "count(*)": 247994 } ], "seriesB": [ { "test_dt": "2012-05-01T00:00:00Z", "negativeCount": -247007 }, { "test_dt": "2012-06-01T00:00:00Z", "negativeCount": -247994 } ] }, { "EOF": true, "RESPONSE_TIME": 7 } ] } }

Now we have both the original time series and the transformed time series in the output.

The Col Evaluator


Lists of tuples are nice, but for performing many math operations what we need are columns of numbers. There is a special evaluator called col which can be used to pull out a column of numbers from a list of tuples.

Here is the basic syntax:

let(a=timeseries(collection, field="test_dt", q="*:*",
                          start="2012-05-01T00:00:00Z",
                          end="2012-06-30T23:59:59Z",
                          gap="+1MONTH", 
                          count(*)),
      b=select(get(a),  
                     mult(-1, count(*)) as negativeCount,
                     test_dt),
      c=col(a, count(*)),
      d=col(b, negativeCount),
      tuple(seriesA=a,
               seriesB=b,
               columnC=c,
               columnD=d))

Now we have two new variables c and d, both pointing to a col expression. The col expression takes two parameters. The first parameter is a variable pointing to a list of tuples. The second parameter is the field to pull the column data from.

Also notice that there are two new fields in the output tuple that output the columns. If you send this expression to Solr it responds with:

{ "result-set": { "docs": [ { "seriesA": [ { "test_dt": "2012-05-01T00:00:00Z", "count(*)": 247007 }, { "test_dt": "2012-06-01T00:00:00Z", "count(*)": 247994 } ], "seriesB": [ { "test_dt": "2012-05-01T00:00:00Z", "negativeCount": -247007 }, { "test_dt": "2012-06-01T00:00:00Z", "negativeCount": -247994 } ], "columnC": [ 247007, 247994 ], "columnD": [ -247007, -247994 ] }, { "EOF": true, "RESPONSE_TIME": 6 } ] } }

Now the columns appear in the output.

Performing Math on Columns


We've seen already that there are numeric Stream Evaluators that work on tuples in the select expression.

Some numeric evaluators also work on columns. An example of this is the corr evaluator which performs the Pearson product-moment correlation calculation on two columns of numbers.

Here is the sample syntax:

let(a=timeseries(collection, field="test_dt", q="*:*",
                          start="2012-05-01T00:00:00Z",
                          end="2012-06-30T23:59:59Z",
                          gap="+1MONTH", 
                          count(*)),
      b=select(get(a),  
                     mult(-1, count(*)) as negativeCount,
                     test_dt),
      c=col(a, count(*)),
      d=col(b, negativeCount),
      tuple(seriesA=a,
               seriesB=b,
               columnC=c,
               columnD=d,
               correlation=corr(c, d)))

Notice that the tuple now has a new field called correlation with the output of the corr function set to it. If you send this to Solr it responds with:

{ "result-set": { "docs": [ { "seriesA": [ { "test_dt": "2012-05-01T00:00:00Z", "count(*)": 247007 }, { "test_dt": "2012-06-01T00:00:00Z", "count(*)": 247994 } ], "seriesB": [ { "test_dt": "2012-05-01T00:00:00Z", "negativeCount": -247007 }, { "test_dt": "2012-06-01T00:00:00Z", "negativeCount": -247994 } ], "columnC": [ 247007, 247994 ], "columnD": [ -247007, -247994 ], "correlation": -1 }, { "EOF": true, "RESPONSE_TIME": 6 } ] } }


Opening the Door to the Wider World of Mathematics


The syntax described in this blog opens the door to more sophisticated mathematics. For example the corr function can be used as a building block for cross-correlation, auto-correlation and auto-regression functions. Apache Commons Math includes machine learning algorithms such as clustering and regression and data transformations such as Fourier transforms that work on columns of numbers.

In the near future the Streaming Expressions math library will include these functions and many more.

Time Series Cross-correlation and Lagged Regression With Solr Streaming Expresssions

One of the more interesting capabilities in Solr's new statistical library is cross-correlation . But before diving into cross-correlat...