## Sunday, July 9, 2017

### One-way ANOVA and Rank Transformation with Solr's Streaming Expressions

In the previous blog we explored the use of random sampling and histograms to pick a threshold for point-wise anomaly detection. Point-wise anomaly detection is a good place to start, but alerting based on a single anomalous point may lead to false alarms. What we need is a statistical technique that can help confirm that the problem goes beyond a single point.

### Spotting Differences In Sets of Data

The specific example in the last blog dealt with finding individual log records with unusually high response times. In this blog we'll be looking for sets of log records with unusually high response times.

One approach to doing this is to compare the means of response times between different sets of data. For this we'll use a statistical approach called One-way Anova.

### One-way ANOVA (Analysis of Variance)

The Streaming Expression statistical library includes the anova function. The anova function is used to determine if the difference in means between two or more sample sets is statistically significant.

In the example below we'll use ANOVA to compare two samples of data:

1. A sample taken from a known period of normal response times.
2. A sample taken before and after the point-wise anomaly.
If the difference in means between the two sets is statistically significant we have evidence that the data around the anomalous data point is also unusual.

### Accounting For Outliers

We already know that sample #2 has at least one outlier point. A few large outliers could skew the mean of a sample #2 and bias the ANOVA calculation.

In order to determine if sample set #2 as a whole has a higher mean then sample #1 we need a way to decrease the effect of outliers on the ANOVA calculation.

### Rank Transformation

One approach for smoothing outliers is to first rank transform the data sets before running the ANOVA. Rank transformation transforms each value in the data to an ordinal ranking.

The Streaming Expression function library includes the rank function which performs the rank transformation.

In order to compare the data sets following the rank transform, we'll need to perform the rank transformation on both sets of data as if they were one contiguous data set. Streaming Expressions provides array manipulation functions that will allow us do this.

### The Streaming Expression

In the expression below we'll perform the ANOVA:

let(a=random(logs,
q="rec_time:[2017-05 TO 2017-06]",
fq="file_name:index.html",
fl="response_time",
rows="7000"),
b=random(logs,
q="rec_time:[NOW-10MINUTES TO NOW]",
fq="file_name:index.html",
fl="response_time",
rows="7000"),
c=col(a, response_time),
d=col(b, response_time),
f=rank(e),
g=copyOfRange(f, 0, length(c)),
h=copyOfRange(f, length(c), length(f)),
i=anova(g, h),
tuple(results=i))

Let's break down what this expression is doing:

1. The let expression is setting the variables a, b, c, d, e, f, g, h, i and returning a single response tuple.
2. The variable a holds the tuples from a random sample of response times from a period of normal response times (sample set #1).
3. The variable b holds the tuples from a random sample of response times before and after the anomalous data point (sample set #2).
4. Variables c and d hold results of the col function which returns a column of numbers from a list of tuples.  Sample set #1 is in variable c. Sample set #2 is in variable d.
5. Variable e holds the result of the addAll function which is returning a single array containing the contents of variables c and d.
6. Variable f holds the results of the rank function which performs the rank transformation on variable e
7. Variables g and hold the values of copyOfRange functions. The copyOfRange function is used to separate the single rank transformed array back into two data sets. Variable g holds the rank transformed values of sample set #1. Variable h holds the rank transformed values of sample set #2.
8. Variable i holds the result of the anova function which is performing the ANOVA on variable g and h.
9. The response tuple has a single field called results that contains the results of the ANOVA on the the rank transformed data sets.

### Interpreting the ANOVA p-value

The response from the Streaming Expression above looks like this:

{ "result-set": { "docs": [ { "results": { "p-value": 0.0008137581457111631, "f-ratio": 38.4 } }, { "EOF": true, "RESPONSE_TIME": 789 } ] } }

The p-value of 0.0008 is the percentage chance that there is NOT a statistically significant difference in the means between the two sample sets.

Based on this p-value we can say with a very high level of confidence that there is a statistically significant difference in the means between the two sample sets.

### New York - Coronavirus Statistics (NYTimes Data Set)

As of 2020-04-09 New York City - Cumulative Cases By Day New York City - Cumulative Deaths By Day ... 