Tuesday, May 30, 2017

Statistical programming with Solr Streaming Expressions

In the previous blog we explored the new timeseries function and introduced the syntax for math expressions. In this blog we'll dive deeper into math expressions and explore the statistical programming functions rolling out in the next release.

Let's first learn how the statistical expressions work and then look at how we can perform statistical analysis on retrieved result sets.

Array Math

The statistical functions create, manipulate and perform math on arrays. One of the basic things that we can do is create an array with the array function:

array(2, 3, 4, 3, 6)

The array function simply returns an array of numbers. If we send the array function above to Solr's stream handler it responds with:

{ "result-set": { "docs": [ { "return-value": [ 2, 3, 4, 3, 6 ] }, { "EOF": true, "RESPONSE_TIME": 1 } ] } }

Notice that the stream handler returns a single Tuple with the return-value field pointing to the array. This is how Solr responds when given a statistical function to evaluate.

This is a new behavior for Solr. In the past the stream handler always returned streams of Tuples. Now the stream handler can directly perform mathematical functions.

Let's explore a few more of the new array math functions. We can manipulate arrays in different ways. For example we can reverse the array like this:

rev(array(2, 3, 4, 3, 6))

Solr returns the following from this expression:

{ "result-set": { "docs": [ { "return-value": [ 6, 3, 4, 3, 2 ] }, { "EOF": true, "RESPONSE_TIME": 0 } ] } }

We can describe the array:

describe(array(2, 3, 4, 3, 6))

{ "result-set": { "docs": [ { "return-value": { "sumsq": 74, "max": 6, "var": 2.3000000000000003, "geometricMean": 3.365865436338599, "sum": 18, "kurtosis": 1.4555765595463175, "N": 5, "min": 2, "mean": 3.6, "popVar": 1.8400000000000003, "skewness": 1.1180799331493778, "stdev": 1.5165750888103102 } }, { "EOF": true, "RESPONSE_TIME": 31 } ] } }

Now we see our first bit of statistics. The describe function provides descriptive statistics for the array.

We can correlate arrays:

corr(array(2, 3, 4, 3, 6),
       array(-2, -3, -4, -3, -6))

This returns:

{ "result-set": { "docs": [ { "return-value": -1 }, { "EOF": true, "RESPONSE_TIME": 2 } ] } }

The corr function performs the Pearson Product Moment correlation on the two arrays. In this case the arrays are perfectly negatively correlated.

We can perform a simple regression on the arrays:

regress(array(2, 3, 4, 3, 6),
             array(-2, -3, -4, -3, -6))

{ "result-set": { "docs": [ { "return-value": { "significance": 0, "totalSumSquares": 9.2, "R": -1, "meanSquareError": 0, "intercept": 0, "slopeConfidenceInterval": 0, "regressionSumSquares": 9.2, "slope": -1, "interceptStdErr": 0, "N": 5 } }, { "EOF": true, "RESPONSE_TIME": 9 } ] } }

All statistical functions in the initial release are backed by Apache Commons Math. The initial release includes a core group of functions that support:

  • Rank transformations
  • Histograms
  • Percentiles
  • Simple regression and predict functions
  • One way ANOVA
  • Correlation
  • Covariance
  • Descriptive statistics
  • Convolution
  • Finding the delay in signals/time series
  • Lagged regression
  • Moving averages
  • Sequence generation
  • Calculating Euclidean distance between arrays
  • Data normalization and scaling
  • Array creation and manipulation functions
Statistical functions can be applied to:
  1.  Time series result sets
  2.  Random sampling result sets
  3.  SQL result sets (Solr's Internal Parallel SQL)
  4.  JDBC result sets (External JDBC Sources)
  5.  K-Nearest Neighbor results sets
  6.  Graph Expression result sets
  7.  Search result sets
  8.  Faceted aggregation result sets
  9.  MapReduce result sets 

Array Math on Solr Result Sets

Let's now explore how we can apply statistical functions on Solr result sets. In the example below we'll correlate arrays of moving averages for two stocks:

let(stockA = sql(stocks, stmt="select closing_price from price_data where ticker='aaa' and ..."),
      stockB = sql(stocks, stmt="select closing_price from price_data where ticker='bbb' and ..."),
      pricesA = col(stockA, closing_price),
      pricesB = col(stockB, closing_price),
      movingA = movingAvg(pricesA, 30),
      movingB = movingAvg(pricesB, 30),
      tuple(correlation=corr(movingA, movingB)))

Let's break down how this expression works:

1) The let expression is setting variables and then returning a single output tuple.

2) The first two variables stockA and stockB contain result sets from sql expressions. The sql expressions return tuples with the closing prices for stock tickers aaa and bbb.

3) The next two variables pricesA and pricesB are created by the col function. The col function creates a numeric array from a list of Tuples. In this example pricesA contains the closing prices for stockA and pricesB contains the closing prices for stockB.

4) The next two variables movingA and movingB are created by the movingAvg function. In this example movingA and movingB contain arrays with the moving averages calculated from the pricesA and pricesB arrays.

5) In the final step we output a single Tuple containing the correlation of the movingA and movingB arrays. The correlation is computed using the corr function.

Feature Scaling with Solr Streaming Expressions

Before performing machine learning operations its often important to scale the feature vectors so they can be compared at the same scale. In...