Wednesday, July 12, 2017

Detrending Time Series Data With Linear Regression in Solr 7

Often when working with time series data there is a linear trend present in the data. For example if a stock price has been gradually rising over a period of months you'll see a positive slope in the time series data. This slope over time is the trend. Before performing statistical analysis on the time series data it's often necessary to remove the trend.

Why is a trend problematic? Consider an example where you want to correlate two time series that are trending on a similar slope. Because they both have a similar slope they will appear to be correlated. But in reality they may be trending for entirely different reasons. To tell if the two time series are actually correlated you would need to first remove the trends and then perform the correlation on the detrended data. 

Linear Regression 


Linear regression is a statistical tool used to measure the linear relationship between two variables. For example you could use linear regression to determine if there is a linear relationship between age and medical costs. If a linear relationship is found you can use linear regression to predict the value of a dependent variable based on the value of an independent variable.

Linear regression can also be used to remove a linear trend from a time series.

Removing a Linear Trend from a Time Series 


We can remove a linear trend from a time series using the following technique:

  1. Regress the dependent variable over a time sequence. For example if we have 12 months of time series observations the time sequence would be expressed as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12.
  2. Use the regression analysis to predict a dependent value at each time interval. Then subtract the prediction from the actual value. The difference between actual and predicted value is known as the residual. The residuals array is the time series with the trend removed. You can now perform statistical analysis on the residuals.
Sounds complicated, but an example will make this more clear and Solr makes this all very easy to do.

Example: Exploring the linear relationship between marketing spend and site usage.


In this example we want explore the linear relationship between marketing spend and website usage. The motivation for this is to determine if higher marketing spend causes higher website usage. 

Website usage has been trending upwards for over a year. We have been varying the marketing spend throughout the year to experiment with how different levels of marketing spend impacts website usage. 

Now we want to regress the marketing spend and the website usage to build a simple model of how usage is impacted by marketing spend. But before we can build this model we must remove the trend from the website usage or the cumulative effect of the trend will mask the relationship between marketing spend and website usage.

Here is the streaming expression:

let(a=timeseries(logs,  
                           q="rec_type:page_view",  
                           field="rec_time", 
                           start="2016-01-01T00:00:00Z", 
                           end="2016-12-31T00:00:00Z", 
                           gap="+1MONTH",  
                           count(*)),
     b=jdbc(connection="jdbc:mysql://...", 
                  sql="select marketing_expense from monthly_expenses where ..."),
     c=col(a, count(*)),
     d=col(b, marketing_expense),
     e=sequence(length(c), 1, 1),
     f=regress(e, c),
     g=residuals(f, e, c),
     h=regress(d, g),
     tuple(regression=h))     

Let's break down what this expression is doing:

  1. The let expression is setting the variables a, b, c, d, e, f, g, h and returning a single result tuple.
  2. Variable a is holding the result tuples from a timeseries function that is querying the logs for monthly usage counts. 
  3. Variable b is holding the result tuples from a jdbc function which is querying an external database for monthly marketing expenses.
  4. Variable c is holding the output from a col function which returns the values in the count(*) field from the tuples stored in variable a. This is an array containing the monthly usage counts.
  5. Variable d is holding the output from a col function which returns the values in the marketing_expense field from the tuples stored in variable bThis is an array containing the monthly marketing expenses.
  6. Variable e holds the output of the sequence function which returns an array of numbers the same length as the array in variable c. The sequence starts from 1 and has a stride of 1. 
  7. Variable f holds the output of the regress function which returns a regression result. The regression is performed with the sequence in variable e as the independent variable and monthly usage counts in variable c as the dependent variable.
  8. Variable g holds the output of the residuals function which returns the residuals from applying the regression result to the data sets in variables e and c. The residuals are the monthly usage counts with the trend removed.
  9. Variable h holds the output of the regress function which returns a regression result. The regression is being performed with the marketing expenses (variable d) as the independent variable. The residuals from the monthly usage regression (variable g) are the dependent variable.  This regression result will describe the linear relationship between marketing expenses and site usage.
  10. The output tuple is returning the regression result.
     

Sunday, July 9, 2017

One-way ANOVA and Rank Transformation with Solr's Streaming Expressions

In the previous blog we explored the use of random sampling and histograms to pick a threshold for point-wise anomaly detection. Point-wise anomaly detection is a good place to start, but alerting based on a single anomalous point may lead to false alarms. What we need is a statistical technique that can help confirm that the problem goes beyond a single point.

Spotting Differences In Sets of Data


The specific example in the last blog dealt with finding individual log records with unusually high response times. In this blog we'll be looking for sets of log records with unusually high response times.

One approach to doing this is to compare the means of response times between different sets of data. For this we'll use a statistical approach called One-way Anova.

One-way ANOVA (Analysis of Variance)


The Streaming Expression statistical library includes the anova function. The anova function is used to determine if the difference in means between two or more sample sets is statistically significant.

In the example below we'll use ANOVA to compare two samples of data:

  1. A sample taken from a known period of normal response times.
  2. A sample taken before and after the point-wise anomaly.
If the difference in means between the two sets is statistically significant we have evidence that the data around the anomalous data point is also unusual.


Accounting For Outliers


We already know that sample #2 has at least one outlier point. A few large outliers could skew the mean of a sample #2 and bias the ANOVA calculation.

In order to determine if sample set #2 as a whole has a higher mean then sample #1 we need a way to decrease the effect of outliers on the ANOVA calculation.

Rank Transformation


One approach for smoothing outliers is to first rank transform the data sets before running the ANOVA. Rank transformation transforms each value in the data to an ordinal ranking.

The Streaming Expression function library includes the rank function which performs the rank transformation.

In order to compare the data sets following the rank transform, we'll need to perform the rank transformation on both sets of data as if they were one contiguous data set. Streaming Expressions provides array manipulation functions that will allow us do this.


The Streaming Expression


In the expression below we'll perform the ANOVA:

let(a=random(logs,
                       q="rec_time:[2017-05 TO 2017-06]",
                       fq="file_name:index.html",
                       fl="response_time",
                       rows="7000"),
     b=random(logs,
                       q="rec_time:[NOW-10MINUTES TO NOW]",
                       fq="file_name:index.html",
                       fl="response_time",
                       rows="7000"),
     c=col(a, response_time),
     d=col(b, response_time),
     e=addAll(c, d),
     f=rank(e),
     g=copyOfRange(f, 0, length(c)),
     h=copyOfRange(f, length(c), length(f)),
     i=anova(g, h),
     tuple(results=i))

Let's break down what this expression is doing:

  1. The let expression is setting the variables a, b, c, d, e, f, g, h, i and returning a single response tuple.
  2. The variable a holds the tuples from a random sample of response times from a period of normal response times (sample set #1).
  3. The variable b holds the tuples from a random sample of response times before and after the anomalous data point (sample set #2).
  4. Variables c and d hold results of the col function which returns a column of numbers from a list of tuples.  Sample set #1 is in variable c. Sample set #2 is in variable d.
  5. Variable e holds the result of the addAll function which is returning a single array containing the contents of variables c and d.
  6. Variable f holds the results of the rank function which performs the rank transformation on variable e
  7. Variables g and hold the values of copyOfRange functions. The copyOfRange function is used to separate the single rank transformed array back into two data sets. Variable g holds the rank transformed values of sample set #1. Variable h holds the rank transformed values of sample set #2.
  8. Variable i holds the result of the anova function which is performing the ANOVA on variable g and h.
  9. The response tuple has a single field called results that contains the results of the ANOVA on the the rank transformed data sets.

Interpreting the ANOVA p-value


The response from the Streaming Expression above looks like this:

{ "result-set": { "docs": [ { "results": { "p-value": 0.0008137581457111631, "f-ratio": 38.4 } }, { "EOF": true, "RESPONSE_TIME": 789 } ] } }


The p-value of 0.0008 is the percentage chance that there is NOT a statistically significant difference in the means between the two sample sets.

Based on this p-value we can say with a very high level of confidence that there is a statistically significant difference in the means between the two sample sets.

Wednesday, June 28, 2017

Random Sampling, Histograms and Point-wise Anomaly Detection In Solr

In the last blog we started to explore Streaming Expression's new statistical programming functions. The last blog described a statistical expression that retrieved two data sets with SQL expressions, computed the moving averages for the data sets and correlated the moving averages.

In this blog we'll explore random sampling, histograms and rule based point-wise anomaly detection.

Turning Mountains into Mole Hills with Random Sampling


Random sampling is one of the most powerful concepts in statistics. Random sampling involves taking a smaller random sample from a larger data set, which can be used to infer statistics about the larger data set.

Random sampling has been used for decades to deal with the problem of not having access to the entire data set. For example taking a poll of everyone in a large population may not be feasible. Taking a random sample of the population is likely much more feasible.

In the big data age we are often presented with a different problem: too much data. It turns out that random sampling helps solve this problem as well. Instead of having to process the entire massive data set we can select a random sample of the data set and infer statistics about the larger data set.

Note: It's important to understand that working with random samples does introduce potential statistical error. There are formulas for determining the margin of error given specific sample sizes. This link also provides a sample size table which shows margin of errors for specific sample sizes.


Solr is a Powerful Random Sampling Engine


Slicing, dicing and creating random samples from large data sets are some of the primary capabilities needed to tackle big data statistical problems. Solr happens to be one of the best engines in the world for doing this type of work.

Solr has had the ability to select random samples from search results for a long time. The new statistical syntax in Streaming Expressions makes this capability much more powerful. Now Solr has the power to select random samples from large distributed data sets and perform statistical analysis on the random samples.


The Random Streaming Expression


The random Streaming Expression retrieves a pseudo random set of documents that match a query. Each time the random expression is run it will return a different set of pseudo random records.

The syntax for the random expression is:

random(collection1,  q="soly query",  fl="fielda, fieldb", rows="17000")

This simple but powerful expression selects 17,000 pseudo random records from a Solr Cloud collection that matches the query.

Understanding Data Distributions with Histograms


Another important statistical tool is the histogram. Histograms are used to understand the distribution of a data set. Histograms divide a data set into bins and provides statistics about each bin. By inspecting the statistics of each bin you can understand the distribution of the data set.

The hist Function


Solr's Streaming Expression library has a hist function which returns a histogram for an array of numbers.

The hist function has a very simple syntax:

hist(col, 10)

The function above takes two parameters:

  1. An array of numbers
  2. The number of bins in the histogram

Creating a Histogram from a Random Sample


Using the Streaming Expression statistical syntax we can combine random sampling and histograms to understand the distribution of large data sets.

In this example we'll work with a sample data set of log records. Our goal is to create a histogram of the response times for the home page.

Here is the basic syntax:

let(a=random(logs, q="file_name:index.html", fl="response_time", rows="17000"),
     b=col(a, response_time),
     c=hist(b, 10),
     tuple(hist=c))

Let's break down what this expression is doing:

1) The let expression is setting variables a, b and c and then returning a single response tuple.

2) Variable a stores the result tuples from the random streaming expression. The random streaming expression is returning 17000 pseudo random records from the logs collection that match the query file_name:index.html.

3) Variable b stores the output of the col function. The col function returns a column of numbers from a list of tuples. In this case the list of tuples is held in the variable a. The field name is response_time.

4) Variable c stores the output of the hist function. The hist function returns a histogram from a column of numbers. In this case the column of numbers is stored in variable b. The number of bins in the histogram is 10.

5) The tuple expression returns a single output tuple with the hist field set to variable c, which contains the histogram.

The output from this expression is a histogram with 10 bins describing the random sample of home page response times. Descriptive statistics are provided for each bin.

By looking at the histogram we can gain a full understanding of the distribution of the data. Below is a sample histogram. Note that N is the number of observations that are in the bin.

{ "result-set": { "docs": [ { "hist": [ { "min": 105.80360488681794, "max": 184.11423669457605, "mean": 158.07101244548903, "var": 676.6416949523991, "sum": 1106.4970871184232, "stdev": 26.012337360421864, "N": 7 }, { "min": 187.1450299482844, "max": 262.86798264568415, "mean": 235.8519937762809, "var": 400.7486779625581, "sum": 31368.315172245355, "stdev": 20.01870819914607, "N": 133 }, { "min": 263.6907639320808, "max": 341.7723630856346, "mean": 312.0580142849335, "var": 428.02686585995957, "sum": 259944.32589934967, "stdev": 20.688810160566497, "N": 833 }, { "min": 342.0007054044787, "max": 420.508689773685, "mean": 387.10102356966337, "var": 497.5116682425222, "sum": 1008398.166398972, "stdev": 22.30496958622724, "N": 2605 }, { "min": 420.5348042867488, "max": 499.173632576587, "mean": 461.5725595026505, "var": 505.85122370654324, "sum": 2267244.4122770214, "stdev": 22.491136558798964, "N": 4912 }, { "min": 499.23963590242806, "max": 577.8765472307315, "mean": 535.9950922008038, "var": 500.5743269892825, "sum": 2589928.2855142825, "stdev": 22.373518431156118, "N": 4832 }, { "min": 577.9106064943256, "max": 656.5613165857329, "mean": 611.5787667510084, "var": 481.60546877783116, "sum": 1647593.1976272168, "stdev": 21.945511358312686, "N": 2694 }, { "min": 656.5932936523765, "max": 734.7738394881361, "mean": 685.4426886363782, "var": 451.02322430952523, "sum": 573715.5303886493, "stdev": 21.237307369568423, "N": 837 }, { "min": 735.9448445737111, "max": 812.751632738434, "mean": 762.5240648996678, "var": 398.4721757713377, "sum": 102178.22469655548, "stdev": 19.961767851854646, "N": 134 }, { "min": 816.2895922221702, "max": 892.6066799061479, "mean": 832.5779161364087, "var": 481.68131277525964, "sum": 10823.512909773315, "stdev": 21.94723929735263, "N": 13 } ] }, { "EOF": true, "RESPONSE_TIME": 986 } ] } }

Point-wise Anomaly Detection


Point-wise anomaly detection deals with finding a single anomalous data point.

Based on the histogram we can devise a rule for detecting when an anomaly response time appears in the logs. For this example let's set a rule that any response time that falls within the last two bins is an anomaly. The specific rule would be:

response_time > 735

Creating an Alert With the Topic Streaming Expression


Now that we have a rule for detecting anomaly response times we can use the topic expression to return all new records in the logs collection that match the anomaly rule. The topic expression would look like this:

topic(checkpoints,
         logs,
         q="file_name:index.html AND response_time:[735 TO *]",
         fl="id, response_time",
         id="response_anomalies")

The expression above provides one time delivery of all records that match the anomaly rule. Notice the anomaly rule is the query for the topic expression. This is a very efficient approach for retrieving just the anomaly records.

We can wrap the topic in an update and daemon expression to run the topic at intervals and store anomaly records in another collection. The collection of anomalies can then be used for alerting.

Tuesday, May 30, 2017

Statistical programming with Solr Streaming Expressions

In the previous blog we explored the new timeseries function and introduced the syntax for math expressions. In this blog we'll dive deeper into math expressions and explore the statistical programming functions rolling out in the next release.

Let's first learn how the statistical expressions work and then look at how we can perform statistical analysis on retrieved result sets.

Array Math


The statistical functions create, manipulate and perform math on arrays. One of the basic things that we can do is create an array with the array function:

array(2, 3, 4, 3, 6)

The array function simply returns an array of numbers. If we send the array function above to Solr's stream handler it responds with:

{ "result-set": { "docs": [ { "return-value": [ 2, 3, 4, 3, 6 ] }, { "EOF": true, "RESPONSE_TIME": 1 } ] } }

Notice that the stream handler returns a single Tuple with the return-value field pointing to the array. This is how Solr responds when given a statistical function to evaluate.

This is a new behavior for Solr. In the past the stream handler always returned streams of Tuples. Now the stream handler can directly perform mathematical functions.

Let's explore a few more of the new array math functions. We can manipulate arrays in different ways. For example we can reverse the array like this:

rev(array(2, 3, 4, 3, 6))

Solr returns the following from this expression:

{ "result-set": { "docs": [ { "return-value": [ 6, 3, 4, 3, 2 ] }, { "EOF": true, "RESPONSE_TIME": 0 } ] } }

We can describe the array:

describe(array(2, 3, 4, 3, 6))

{ "result-set": { "docs": [ { "return-value": { "sumsq": 74, "max": 6, "var": 2.3000000000000003, "geometricMean": 3.365865436338599, "sum": 18, "kurtosis": 1.4555765595463175, "N": 5, "min": 2, "mean": 3.6, "popVar": 1.8400000000000003, "skewness": 1.1180799331493778, "stdev": 1.5165750888103102 } }, { "EOF": true, "RESPONSE_TIME": 31 } ] } }

Now we see our first bit of statistics. The describe function provides descriptive statistics for the array.

We can correlate arrays:

corr(array(2, 3, 4, 3, 6),
       array(-2, -3, -4, -3, -6))

This returns:

{ "result-set": { "docs": [ { "return-value": -1 }, { "EOF": true, "RESPONSE_TIME": 2 } ] } }


The corr function performs the Pearson Product Moment correlation on the two arrays. In this case the arrays are perfectly negatively correlated.

We can perform a simple regression on the arrays:

regress(array(2, 3, 4, 3, 6),
             array(-2, -3, -4, -3, -6))

{ "result-set": { "docs": [ { "return-value": { "significance": 0, "totalSumSquares": 9.2, "R": -1, "meanSquareError": 0, "intercept": 0, "slopeConfidenceInterval": 0, "regressionSumSquares": 9.2, "slope": -1, "interceptStdErr": 0, "N": 5 } }, { "EOF": true, "RESPONSE_TIME": 9 } ] } }


All statistical functions in the initial release are backed by Apache Commons Math. The initial release includes a core group of functions that support:

  • Rank transformations
  • Histograms
  • Percentiles
  • Simple regression and predict functions
  • One way ANOVA
  • Correlation
  • Covariance
  • Descriptive statistics
  • Convolution
  • Finding the delay in signals/time series
  • Lagged regression
  • Moving averages
  • Sequence generation
  • Calculating Euclidean distance between arrays
  • Data normalization and scaling
  • Array creation and manipulation functions
Statistical functions can be applied to:
  1.  Time series result sets
  2.  Random sampling result sets
  3.  SQL result sets (Solr's Internal Parallel SQL)
  4.  JDBC result sets (External JDBC Sources)
  5.  K-Nearest Neighbor results sets
  6.  Graph Expression result sets
  7.  Search result sets
  8.  Faceted aggregation result sets
  9.  MapReduce result sets 


Array Math on Solr Result Sets


Let's now explore how we can apply statistical functions on Solr result sets. In the example below we'll correlate arrays of moving averages for two stocks:

let(stockA = sql(stocks, stmt="select closing_price from price_data where ticker='aaa' and ..."),
      stockB = sql(stocks, stmt="select closing_price from price_data where ticker='bbb' and ..."),
      pricesA = col(stockA, closing_price),
      pricesB = col(stockB, closing_price),
      movingA = movingAvg(pricesA, 30),
      movingB = movingAvg(pricesB, 30),
      tuple(correlation=corr(movingA, movingB)))

Let's break down how this expression works:

1) The let expression is setting variables and then returning a single output tuple.

2) The first two variables stockA and stockB contain result sets from sql expressions. The sql expressions return tuples with the closing prices for stock tickers aaa and bbb.

3) The next two variables pricesA and pricesB are created by the col function. The col function creates a numeric array from a list of Tuples. In this example pricesA contains the closing prices for stockA and pricesB contains the closing prices for stockB.

4) The next two variables movingA and movingB are created by the movingAvg function. In this example movingA and movingB contain arrays with the moving averages calculated from the pricesA and pricesB arrays.

5) In the final step we output a single Tuple containing the correlation of the movingA and movingB arrays. The correlation is computed using the corr function.

Monday, May 1, 2017

Exploring Solr's New Time Series and Math Expressions

In Solr 6.6 the Streaming Expression library has added support for time series and math expressions. This blog will walk through an example of how to use these exciting features.


Time Series


Time series aggregations are supported through the timeseries Streaming Expression. The timeseries expression uses the json facet api under the covers so the syntax will be familiar if you've used Solr date range syntax.

Here is the basic syntax:

timeseries(collection, 
                 field="test_dt", 
                 q="*:*",
                 start="2012-05-01T00:00:00Z",
                 end="2012-06-30T23:59:59Z",
                 gap="+1MONTH", 
                 count(*))

When sent to Solr this expression will return results that look like this:

{ "result-set": { "docs": [ { "test_dt": "2012-05-01T00:00:00Z", "count(*)": 247007 }, { "test_dt": "2012-06-01T00:00:00Z", "count(*)": 247994 }, { "EOF": true, "RESPONSE_TIME": 9 } ] } }

Solr takes care of the date math and builds the time range buckets automatically. Solr also fills in any gaps in the range with buckets automatically and adds zero aggregation values. Any Solr query can be used to select the records. 

The supported aggregations are: count(*), sum(field), avg(field), min(field), max(field).

The timeseries function is quite powerful on it's own, but it grows in power when combined with math expressions.


Math Expressions


In Solr 6.6 the Streaming Expression library also adds math expressions. This is a larger topic then one blog can cover, but I'll hit some of highlights by slowly building up a math expression.


Let and Get


The fun begins with the let and get expressions. let is used to assign tuple streams to variables and get is used to retrieve the stream later in the expression. Here is the most basic example:


let(a=timeseries(collection, field="test_dt", q="*:*",
                          start="2012-05-01T00:00:00Z",
                          end="2012-06-30T23:59:59Z",
                          gap="+1MONTH", 
                          count(*)),
      get(a))

In the example above the timeseries expression is being set to the variable a. Then the get expression is used to turn the variable a back into a stream.

The let expression allows you to set any number of variables, and assign a single Streaming Expression to run the program logic. The expression that runs the program logic has access to the variables. The basic structure of let is:

let(a=expr,
     b=expr,
     c=expr,
     expr)

The first three name/value pairs are setting variables and the final expression is the program logic that will use the variables.

If we send the let expression with the timeseries to Solr it returns with:

{ "result-set": { "docs": [ { "test_dt": "2012-05-01T00:00:00Z", "count(*)": 247007 }, { "test_dt": "2012-06-01T00:00:00Z", "count(*)": 247994 }, { "EOF": true, "RESPONSE_TIME": 9 } ] } }

This is the exact same response we would get if we sent the timeseries expression alone. Thats because all we did was assign the expression to a variable and use get to stream out the results.

Implementation Note: Under the covers the let expression sets each variable by executing the expressions and adding the tuples to a list. It then maps the variable name to the list in memory so that it can be retrieved by the variable name. So in memory Streams are converted to lists of tuples.


The Select Expression


The select expression has been around for a long time, but it now plays a central role in math expressions. The select expression wraps another expression and applies a list of Stream Evaluators to each tuple. Stream Evaluators perform operations on the tuples. 

The Streaming Expression library now includes a base set of numeric evaluators for performing math on tuples. Here is an example of select in action:

let(a=timeseries(collection, field="test_dt", q="*:*",
                          start="2012-05-01T00:00:00Z",
                          end="2012-06-30T23:59:59Z",
                          gap="+1MONTH", 
                          count(*)),
      b=select(get(a),  
                     mult(-1, count(*)) as negativeCount,
                     test_dt),
      get(b))

In the example above we've set a timeseries to variable a.

Then we are doing something really interesting with variable b. We are transforming the timeseries tuples stored in variable a with the select expression. 

The select expression is reading all the tuples from the get(a) expression and applying the mult stream evaluator to each tuple. The mult Streaming Evaluator is multiplying -1 to the value in the count(*) field of the tuples and assigning it to the field negativeCount. Select is also outputting the test_dt field from the tuples.

The transformed tuples are then assigned to variable b.

Then get(b) is used to output the transformed tuples. If you send this expression to Solr it outputs:

{ "result-set": { "docs": [ { "test_dt": "2012-05-01T00:00:00Z", "negativeCount": -247007 }, { "test_dt": "2012-06-01T00:00:00Z", "negativeCount": -247994 }, { "EOF": true, "RESPONSE_TIME": 9 } ] } }

Implementation Note: The get expression creates new tuples when it streams tuples from a variable. So you never have to worry about side effects. In the example above variable a was unchanged when the tuples were transformed and assigned to variable b.



The Tuple Expression


The basic data structure of Streaming Expressions is a Tuple. A Tuple is a set of name/value pairs. In the 6.6 release of Solr there is a Tuple expression which allows you to create your own output tuple. Here is the sample syntax:

let(a=timeseries(collection, field="test_dt", q="*:*",
                          start="2012-05-01T00:00:00Z",
                          end="2012-06-30T23:59:59Z",
                          gap="+1MONTH", 
                          count(*)),
      b=select(get(a),  
                     mult(-1, count(*)) as negativeCount,
                     test_dt),
      tuple(seriesA=a,
               seriesB=b))

The example above defines an output tuple with two fields: seriesA and seriesB, both of these fields have been assigned a variable. Remember that variables a and b are pointers to lists of tuples. This is exactly how they will be output by the tuple expression.

If you send the expression above to Solr it will respond with:

{ "result-set": { "docs": [ { "seriesA": [ { "test_dt": "2012-05-01T00:00:00Z", "count(*)": 247007 }, { "test_dt": "2012-06-01T00:00:00Z", "count(*)": 247994 } ], "seriesB": [ { "test_dt": "2012-05-01T00:00:00Z", "negativeCount": -247007 }, { "test_dt": "2012-06-01T00:00:00Z", "negativeCount": -247994 } ] }, { "EOF": true, "RESPONSE_TIME": 7 } ] } }

Now we have both the original time series and the transformed time series in the output.

The Col Evaluator


Lists of tuples are nice, but for performing many math operations what we need are columns of numbers. There is a special evaluator called col which can be used to pull out a column of numbers from a list of tuples.

Here is the basic syntax:

let(a=timeseries(collection, field="test_dt", q="*:*",
                          start="2012-05-01T00:00:00Z",
                          end="2012-06-30T23:59:59Z",
                          gap="+1MONTH", 
                          count(*)),
      b=select(get(a),  
                     mult(-1, count(*)) as negativeCount,
                     test_dt),
      c=col(a, count(*)),
      d=col(b, negativeCount),
      tuple(seriesA=a,
               seriesB=b,
               columnC=c,
               columnD=d))

Now we have two new variables c and d, both pointing to a col expression. The col expression takes two parameters. The first parameter is a variable pointing to a list of tuples. The second parameter is the field to pull the column data from.

Also notice that there are two new fields in the output tuple that output the columns. If you send this expression to Solr it responds with:

{ "result-set": { "docs": [ { "seriesA": [ { "test_dt": "2012-05-01T00:00:00Z", "count(*)": 247007 }, { "test_dt": "2012-06-01T00:00:00Z", "count(*)": 247994 } ], "seriesB": [ { "test_dt": "2012-05-01T00:00:00Z", "negativeCount": -247007 }, { "test_dt": "2012-06-01T00:00:00Z", "negativeCount": -247994 } ], "columnC": [ 247007, 247994 ], "columnD": [ -247007, -247994 ] }, { "EOF": true, "RESPONSE_TIME": 6 } ] } }

Now the columns appear in the output.

Performing Math on Columns


We've seen already that there are numeric Stream Evaluators that work on tuples in the select expression.

Some numeric evaluators also work on columns. An example of this is the corr evaluator which performs the Pearson product-moment correlation calculation on two columns of numbers.

Here is the sample syntax:

let(a=timeseries(collection, field="test_dt", q="*:*",
                          start="2012-05-01T00:00:00Z",
                          end="2012-06-30T23:59:59Z",
                          gap="+1MONTH", 
                          count(*)),
      b=select(get(a),  
                     mult(-1, count(*)) as negativeCount,
                     test_dt),
      c=col(a, count(*)),
      d=col(b, negativeCount),
      tuple(seriesA=a,
               seriesB=b,
               columnC=c,
               columnD=d,
               correlation=corr(c, d)))

Notice that the tuple now has a new field called correlation with the output of the corr function set to it. If you send this to Solr it responds with:

{ "result-set": { "docs": [ { "seriesA": [ { "test_dt": "2012-05-01T00:00:00Z", "count(*)": 247007 }, { "test_dt": "2012-06-01T00:00:00Z", "count(*)": 247994 } ], "seriesB": [ { "test_dt": "2012-05-01T00:00:00Z", "negativeCount": -247007 }, { "test_dt": "2012-06-01T00:00:00Z", "negativeCount": -247994 } ], "columnC": [ 247007, 247994 ], "columnD": [ -247007, -247994 ], "correlation": -1 }, { "EOF": true, "RESPONSE_TIME": 6 } ] } }


Opening the Door to the Wider World of Mathematics


The syntax described in this blog opens the door to more sophisticated mathematics. For example the corr function can be used as a building block for cross-correlation, auto-correlation and auto-regression functions. Apache Commons Math includes machine learning algorithms such as clustering and regression and data transformations such as Fourier transforms that work on columns of numbers.

In the near future the Streaming Expressions math library will include these functions and many more.

Wednesday, April 19, 2017

Having a chat with Solr using the new echo Streaming Expression

In the next release of Solr, there is a new and interesting Streaming Expression called echo.

echo is a very simple expression with the following syntax:

echo("Hello World")

If we send this to Solr, it responds with:

{ "result-set": { "docs": [ { "echo": "Hello World" }, { "EOF": true, "RESPONSE_TIME": 0 } ] } }

Solr simply echoes the text back, but maybe it feels a bit like Solr is talking to us. Like there might be someone there.

Well it turns out that this simple exchange is the first step towards a more meaningful conversation.

Let's take another step:

classify(echo("Customer service is just terrible!"),
             model(models, id="sentiment"),
             field="echo",
             analyzerField="message_t")

Now we are echoing text to a classifier.  The classify function is pointing to a model stored in Solr that does sentiment analysis based on the text. Notice that the classify function has an analyzer field parameter. This is a Lucene/Solr analyzer used by the classify function to pull the features from the text (See this blog for more details on the classify function).

If we send this to Solr we may get a response like this:

{ "result-set": { "docs": [ { "echo": "Customer service is just terrible!",
"probability_d":0.94888 }, { "EOF": true, "RESPONSE_TIME": 0 } ] } }

The probability_d field is the probability that the text has a negative sentiment. In this case there was a 94% probability that the text was negative.

Now Solr knows something about what's being said. We can wrap other Streaming Expressions around this to take actions or begin to formulate a response.

But we really don't yet have enough information to make a very informed response.

We can take this a bit further.

Consider this expression:

select(echo("Customer service is just terrible!"),
           analyze(echo, analyzerField) as expr_s)

The expression above uses the select expression to echo the text to the analyze Stream Evaluator. The analyze Steam Evaluator applies a Lucene/Solr analyzer to the text and returns a token stream. But in this case it returns a single token which is a Streaming Expression. 

(See this blog for more details on the analyze Stream Evaluator)

In order to make this work you would define the final step of the analyzer chain as a token filter that builds a Streaming Expression based on the natural language parsing done earlier in the analyzer chain.

Now we can wrap this construct in the new eval expression:

eval(select(echo("Customer service is just terrible!"),
                  analyze(echo, analyzerField) as expr_s))

The eval expression will compile and run the Streaming Expression created by the analyzer.  It will also emit the tuples that are emitted by the compiled expression. The tuples emitted are the response to the natural language request.

The heavy lifting is done in the analysis chain which performs the NLP and generates the Streaming Expression response.

Streaming Expressions as an AI Language

Before Streaming Expressions existed Dennis Gove shared an email with me with his initial design for the Streaming Expression syntax. The initial syntax used Lisp like S-Expressions. I took one look at the S-Expressions and realized we were building an AI language. I'll get into more detail about how this syntax ties into AI shortly, but first a little more history on Streaming Expressions.

The S-Expressions were replaced with the more familiar function syntax that Streaming Expressions has today. This decision was made by Dennis and Steven Bower. It turned out to be the right call because we now have a more familiar syntax than Lisp but we also kept many of Lisps most important qualities.

Dennis contributed the Streaming Expression parser and I began looking for something interesting to do with it. The very first thing I tried to do with Streaming Expressions was to re-write SQL queries as Streaming Expressions for the Parallel SQL interface. For this project a SQL parser was used to parse the queries and then a simple planner was built that generated Streaming Expressions to implement the physical query plan.

This was an important proving ground for Streaming Expressions for a number of reasons. It proved that Streaming Expressions could provide the functionality needed to implement the SQL query plans. It proved that Streaming Expressions could push functionality down into the search engine and also rise above the search engine using MapReduce when needed.

Most importantly from an AI standpoint it proved that we could easily generate Streaming Expressions programmatically. This was one of the key features that made Lisp a useful AI Language. The reason that Streaming Expressions are so easily generated is that the syntax is extremely regular. There are only nested functions. And because Streaming Expressions have an underlying Java object representation, we didn't have to do any String manipulation. We could work directly with the Object tree structure to build the expressions.

Why is code generation important for AI? One of the reasons is shown earlier in this blog. A core AI use case is to respond to natural language requests. One approach to doing this is to analyze the text request and then generate code to implement a response. In many ways it's similar to the problem of translating SQL to a physical query plan.

In a more general sense code generation is important in AI because you're dealing with many unknowns so it can be difficult to code everything up front. Sometimes you may need to generate logic on the fly.

Domain Specific Languages

Lisp has the capability of adapting its syntax for specific domains through it's powerful macro feature. Streaming Expressions has this capability as well, but it does it a different way.

Each Streaming Expression is implemented in Java under the covers. Each Streaming Expression is responsible for parsing it's own parameters. This means you can have Streaming Expressions that invent their own little languages. The select expression is a perfect example of this.

The basic select expression looks like this:

select(expr, fielda as outField)

This select reads tuples from a stream and outputs fielda as outField. The Streaming Expression parser has no concept of the word "as". This is specific to the select expression and the select expression handles the parsing of "as".

The reason why this works is that under the covers Streaming Expressions see all parameters as lists that it can manipulate any way it wants.

Embedded In a Search Engine

Having an AI language embedded in a search engine is a huge advantage. It allows expressions to leverage vast amounts of information in interesting ways. The inverted index already has important statistics about the text which can be used for machine learning. Search engines have strong facilities for working with text (tokenizers, filters etc..) and in recent years they've become powerful column stores for numeric calculations. They also have mature content ingestion and parallel query frameworks.

Now there is a language that ties it all together.

Thursday, March 30, 2017

Streaming NLP is coming in Solr 6.6

Solr 6.5 is out now, so it's time to start thinking about the next release. One of the interesting features coming in Solr 6.6 is Streaming NLP. This exciting new feature is already committed and waiting for release. This blog will describe how Streaming NLP works.

The analyze Stream Evaluator


One of the features added in Solr 6.5 was Stream Evaluators. Stream Evaluators perform operations on Tuples in the stream. There are already a rich set of math and boolean Stream Evaluators in Solr 6.5 and more coming in Solr 6.6. The math and boolean Stream Evaluators allow you to build complex boolean logic and mathematical formulas on Tuples in the stream.

Solr 6.6 also has a new Stream Evaluator, called analyze, that works with text. The analyze evaluator applies a Lucene/Solr analyzer to a text field in the Tuples and returns a list of tokens produced by the analyzer. The tokens can then by used to annotate Tuples or streamed out as Tuples. We'll show examples of both approaches later in the blog.

But it's useful to talk about the power behind Lucene/Solr analyzers first. Lucene/Solr has a large set of analyzers that tokenize different languages and apply filters that transform the token stream. The "analyzer chain" design allows you to chain tokenizers and filters together to perform very powerful text transformations and extractions.

The analysis chain also provides a pluggable API for adding new NLP tokenizers and filters to Solr. New tokenizers and filters can be added and then layered with existing tokenizers and filters in interesting ways. New NLP analysis chains can then be used both during indexing and with Streaming NLP.

The cartesianProduct Streaming Expression


The cartesianProduct Streaming Expression is also new in Solr 6.6. The cartesianProduct expression emits a stream of Tuples from a single Tuple by creating a cartesian product from a multi-valued field or a text field. The analyze Stream Evaluator is used with the cartesianProduct Streaming Expression to create a cartesian product from a text field.

Here is a very simple example:

For this example we have indexed a single record in Solr with an id and text field called body:

id: 1
body: "c d e f g"

The following expression will create a cartesian product from this Tuple:

cartesianProduct(search(collection, q="id:1", fl="id, body", sort="id desc"),
                              analyze(body, analyzerField) as outField)

First let's look at what this expression is doing then look at the output.

The cartesianProduct expression is wrapping a search expression and an analyze Stream Evaluator. The cartesianProduct expression reads the Tuples returned by the search expression and applies the analyze Stream Evaluator to each Tuple. (Note that the cartesianProduct expression can read Tuples from any Streaming Expression.)

The analyze Stream Evaluator is taking the text from the body field in the Tuple and is applying an analyzer found in the schema which is pointed to by the analyzerField parameter.

The cartesianProduct function emits a single Tuple for each token produced by the analyzer. For example if we have a basic white space tokenizing analyzer the Tuples emitted would be:

id: 1
outField: c

id: 1
outField: d

id: 1
outField: e

id: 1
outField: f

id: 1
outField: g

Creating Entity Graphs


The Tuples emitted by the cartesianProduct and the analyze evaluator can be saved to another Solr Cloud collection with the update stream. This allows you to build graphs from extracted entities that can then be walked with Solr Graph Expressions.

Annotating Tuples


The analyze Stream Evaluator can also be used with the select Streaming Expression to annotate Tuples with tokens extracted by an analyzer. Here is the sample syntax:

select(search(collection, q="id:1", fl="id, body", sort="id desc"),
          id,
          analyze(body, analyzerField) as outField)

This will add a field to each Tuple which will contain the list of Tuples extracted by the analyzer. The update function can be used to save the annotated Tuples to another Solr Cloud collection.

Scaling Up


Solr's parallel batch and executor framework can be used to apply a massive amount of computing power to perform NLP on extremely large data sets. You can read about the parallel batch and the executor framework in these blogs:



Detrending Time Series Data With Linear Regression in Solr 7

Often when working with time series data there is a linear trend present in the data. For example if a stock price has been gradually rising...