Solr Analytics: Statistical programming with Solr Streaming Expressions

In the previous blog we explored the new timeseries function and introduced the syntax for math expressions. In this blog we'll dive deeper into math expressions and explore the statistical programming functions rolling out in the next release.

Let's first learn how the statistical expressions work and then look at how we can perform statistical analysis on retrieved result sets.

Array Math

The statistical functions create, manipulate and perform math on arrays. One of the basic things that we can do is create an array with the array function:

array(2, 3, 4, 3, 6)

The array function simply returns an array of numbers. If we send the array function above to Solr's stream handler it responds with:

{ "result-set": { "docs": [ { "return-value": [ 2, 3, 4, 3, 6 ] }, { "EOF": true, "RESPONSE_TIME": 1 } ] } }

Notice that the stream handler returns a single Tuple with the return-value field pointing to the array. This is how Solr responds when given a statistical function to evaluate.

This is a new behavior for Solr. In the past the stream handler always returned streams of Tuples. Now the stream handler can directly perform mathematical functions.

Let's explore a few more of the new array math functions. We can manipulate arrays in different ways. For example we can reverse the array like this:

rev(array(2, 3, 4, 3, 6))

Solr returns the following from this expression:

{ "result-set": { "docs": [ { "return-value": [ 6, 3, 4, 3, 2 ] }, { "EOF": true, "RESPONSE_TIME": 0 } ] } }

We can describe the array:

describe(array(2, 3, 4, 3, 6))

{ "result-set": { "docs": [ { "return-value": { "sumsq": 74, "max": 6, "var": 2.3000000000000003, "geometricMean": 3.365865436338599, "sum": 18, "kurtosis": 1.4555765595463175, "N": 5, "min": 2, "mean": 3.6, "popVar": 1.8400000000000003, "skewness": 1.1180799331493778, "stdev": 1.5165750888103102 } }, { "EOF": true, "RESPONSE_TIME": 31 } ] } }

Now we see our first bit of statistics. The describe function provides descriptive statistics for the array.

We can correlate arrays:

corr(array(2, 3, 4, 3, 6),
array(-2, -3, -4, -3, -6))

This returns:

{ "result-set": { "docs": [ { "return-value": -1 }, { "EOF": true, "RESPONSE_TIME": 2 } ] } }

The corr function performs the Pearson Product Moment correlation on the two arrays. In this case the arrays are perfectly negatively correlated.

We can perform a simple regression on the arrays:

regress(array(2, 3, 4, 3, 6),
array(-2, -3, -4, -3, -6))

{ "result-set": { "docs": [ { "return-value": { "significance": 0, "totalSumSquares": 9.2, "R": -1, "meanSquareError": 0, "intercept": 0, "slopeConfidenceInterval": 0, "regressionSumSquares": 9.2, "slope": -1, "interceptStdErr": 0, "N": 5 } }, { "EOF": true, "RESPONSE_TIME": 9 } ] } }

All statistical functions in the initial release are backed by Apache Commons Math. The initial release includes a core group of functions that support:

Rank transformations
Histograms
Percentiles
Simple regression and predict functions
One way ANOVA
Correlation
Covariance
Descriptive statistics
Convolution
Finding the delay in signals/time series
Lagged regression
Moving averages
Sequence generation
Calculating Euclidean distance between arrays
Data normalization and scaling
Array creation and manipulation functions

Statistical functions can be applied to:

Time series result sets
Random sampling result sets
SQL result sets (Solr's Internal Parallel SQL)
JDBC result sets (External JDBC Sources)
K-Nearest Neighbor results sets
Graph Expression result sets
Search result sets
Faceted aggregation result sets
MapReduce result sets

Array Math on Solr Result Sets

Let's now explore how we can apply statistical functions on Solr result sets. In the example below we'll correlate arrays of moving averages for two stocks:

let(stockA = sql(stocks, stmt="select closing_price from price_data where ticker='aaa' and ..."),
stockB = sql(stocks, stmt="select closing_price from price_data where ticker='bbb' and ..."),
pricesA = col(stockA, closing_price),
pricesB = col(stockB, closing_price),
movingA = movingAvg(pricesA, 30),
movingB = movingAvg(pricesB, 30),
tuple(correlation=corr(movingA, movingB)))

Let's break down how this expression works:

1) The let expression is setting variables and then returning a single output tuple.

2) The first two variables stockA and stockB contain result sets from sql expressions. The sql expressions return tuples with the closing prices for stock tickers aaa and bbb.

3) The next two variables pricesA and pricesB are created by the col function. The col function creates a numeric array from a list of Tuples. In this example pricesA contains the closing prices for stockA and pricesB contains the closing prices for stockB.

4) The next two variables movingA and movingB are created by the movingAvg function. In this example movingA and movingB contain arrays with the moving averages calculated from the pricesA and pricesB arrays.

5) In the final step we output a single Tuple containing the correlation of the movingA and movingB arrays. The correlation is computed using the corr function.

Solr Analytics

Tuesday, May 30, 2017

Statistical programming with Solr Streaming Expressions

Array Math

Array Math on Solr Result Sets

Solr temporal graph queries for event correlation, root cause analysis and temporal anomaly detection

Search This Blog