Wednesday, June 28, 2017

Random Sampling, Histograms and Point-wise Anomaly Detection In Solr

In the last blog we started to explore Streaming Expression's new statistical programming functions. The last blog described a statistical expression that retrieved two data sets with SQL expressions, computed the moving averages for the data sets and correlated the moving averages.

In this blog we'll explore random sampling, histograms and rule based point-wise anomaly detection.

Turning Mountains into Mole Hills with Random Sampling

Random sampling is one of the most powerful concepts in statistics. Random sampling involves taking a smaller random sample from a larger data set, which can be used to infer statistics about the larger data set.

Random sampling has been used for decades to deal with the problem of not having access to the entire data set. For example taking a poll of everyone in a large population may not be feasible. Taking a random sample of the population is likely much more feasible.

In the big data age we are often presented with a different problem: too much data. It turns out that random sampling helps solve this problem as well. Instead of having to process the entire massive data set we can select a random sample of the data set and infer statistics about the larger data set.

Note: It's important to understand that working with random samples does introduce potential statistical error. There are formulas for determining the margin of error given specific sample sizes. This link also provides a sample size table which shows margin of errors for specific sample sizes.

Solr is a Powerful Random Sampling Engine

Slicing, dicing and creating random samples from large data sets are some of the primary capabilities needed to tackle big data statistical problems. Solr happens to be one of the best engines in the world for doing this type of work.

Solr has had the ability to select random samples from search results for a long time. The new statistical syntax in Streaming Expressions makes this capability much more powerful. Now Solr has the power to select random samples from large distributed data sets and perform statistical analysis on the random samples.

The Random Streaming Expression

The random Streaming Expression retrieves a pseudo random set of documents that match a query. Each time the random expression is run it will return a different set of pseudo random records.

The syntax for the random expression is:

random(collection1,  q="soly query",  fl="fielda, fieldb", rows="17000")

This simple but powerful expression selects 17,000 pseudo random records from a Solr Cloud collection that matches the query.

Understanding Data Distributions with Histograms

Another important statistical tool is the histogram. Histograms are used to understand the distribution of a data set. Histograms divide a data set into bins and provides statistics about each bin. By inspecting the statistics of each bin you can understand the distribution of the data set.

The hist Function

Solr's Streaming Expression library has a hist function which returns a histogram for an array of numbers.

The hist function has a very simple syntax:

hist(col, 10)

The function above takes two parameters:

  1. An array of numbers
  2. The number of bins in the histogram

Creating a Histogram from a Random Sample

Using the Streaming Expression statistical syntax we can combine random sampling and histograms to understand the distribution of large data sets.

In this example we'll work with a sample data set of log records. Our goal is to create a histogram of the response times for the home page.

Here is the basic syntax:

let(a=random(logs, q="file_name:index.html", fl="response_time", rows="17000"),
     b=col(a, response_time),
     c=hist(b, 10),

Let's break down what this expression is doing:

1) The let expression is setting variables a, b and c and then returning a single response tuple.

2) Variable a stores the result tuples from the random streaming expression. The random streaming expression is returning 17000 pseudo random records from the logs collection that match the query file_name:index.html.

3) Variable b stores the output of the col function. The col function returns a column of numbers from a list of tuples. In this case the list of tuples is held in the variable a. The field name is response_time.

4) Variable c stores the output of the hist function. The hist function returns a histogram from a column of numbers. In this case the column of numbers is stored in variable b. The number of bins in the histogram is 10.

5) The tuple expression returns a single output tuple with the hist field set to variable c, which contains the histogram.

The output from this expression is a histogram with 10 bins describing the random sample of home page response times. Descriptive statistics are provided for each bin.

By looking at the histogram we can gain a full understanding of the distribution of the data. Below is a sample histogram. Note that N is the number of observations that are in the bin.

{ "result-set": { "docs": [ { "hist": [ { "min": 105.80360488681794, "max": 184.11423669457605, "mean": 158.07101244548903, "var": 676.6416949523991, "sum": 1106.4970871184232, "stdev": 26.012337360421864, "N": 7 }, { "min": 187.1450299482844, "max": 262.86798264568415, "mean": 235.8519937762809, "var": 400.7486779625581, "sum": 31368.315172245355, "stdev": 20.01870819914607, "N": 133 }, { "min": 263.6907639320808, "max": 341.7723630856346, "mean": 312.0580142849335, "var": 428.02686585995957, "sum": 259944.32589934967, "stdev": 20.688810160566497, "N": 833 }, { "min": 342.0007054044787, "max": 420.508689773685, "mean": 387.10102356966337, "var": 497.5116682425222, "sum": 1008398.166398972, "stdev": 22.30496958622724, "N": 2605 }, { "min": 420.5348042867488, "max": 499.173632576587, "mean": 461.5725595026505, "var": 505.85122370654324, "sum": 2267244.4122770214, "stdev": 22.491136558798964, "N": 4912 }, { "min": 499.23963590242806, "max": 577.8765472307315, "mean": 535.9950922008038, "var": 500.5743269892825, "sum": 2589928.2855142825, "stdev": 22.373518431156118, "N": 4832 }, { "min": 577.9106064943256, "max": 656.5613165857329, "mean": 611.5787667510084, "var": 481.60546877783116, "sum": 1647593.1976272168, "stdev": 21.945511358312686, "N": 2694 }, { "min": 656.5932936523765, "max": 734.7738394881361, "mean": 685.4426886363782, "var": 451.02322430952523, "sum": 573715.5303886493, "stdev": 21.237307369568423, "N": 837 }, { "min": 735.9448445737111, "max": 812.751632738434, "mean": 762.5240648996678, "var": 398.4721757713377, "sum": 102178.22469655548, "stdev": 19.961767851854646, "N": 134 }, { "min": 816.2895922221702, "max": 892.6066799061479, "mean": 832.5779161364087, "var": 481.68131277525964, "sum": 10823.512909773315, "stdev": 21.94723929735263, "N": 13 } ] }, { "EOF": true, "RESPONSE_TIME": 986 } ] } }

Point-wise Anomaly Detection

Point-wise anomaly detection deals with finding a single anomalous data point.

Based on the histogram we can devise a rule for detecting when an anomaly response time appears in the logs. For this example let's set a rule that any response time that falls within the last two bins is an anomaly. The specific rule would be:

response_time > 735

Creating an Alert With the Topic Streaming Expression

Now that we have a rule for detecting anomaly response times we can use the topic expression to return all new records in the logs collection that match the anomaly rule. The topic expression would look like this:

         q="file_name:index.html AND response_time:[735 TO *]",
         fl="id, response_time",

The expression above provides one time delivery of all records that match the anomaly rule. Notice the anomaly rule is the query for the topic expression. This is a very efficient approach for retrieving just the anomaly records.

We can wrap the topic in an update and daemon expression to run the topic at intervals and store anomaly records in another collection. The collection of anomalies can then be used for alerting.

New York - Coronavirus Statistics (NYTimes Data Set)

As of 2020-04-09 New York City - Cumulative Cases By Day New York City - Cumulative Deaths By Day ...