## Wednesday, July 12, 2017

### Detrending Time Series Data With Linear Regression in Solr 7

Often when working with time series data there is a linear trend present in the data. For example if a stock price has been gradually rising over a period of months you'll see a positive slope in the time series data. This slope over time is the trend. Before performing statistical analysis on the time series data it's often necessary to remove the trend.

Why is a trend problematic? Consider an example where you want to correlate two time series that are trending on a similar slope. Because they both have a similar slope they will appear to be correlated. But in reality they may be trending for entirely different reasons. To tell if the two time series are actually correlated you would need to first remove the trends and then perform the correlation on the detrended data.

### Linear Regression

Linear regression is a statistical tool used to measure the linear relationship between two variables. For example you could use linear regression to determine if there is a linear relationship between age and medical costs. If a linear relationship is found you can use linear regression to predict the value of a dependent variable based on the value of an independent variable.

Linear regression can also be used to remove a linear trend from a time series.

### Removing a Linear Trend from a Time Series

We can remove a linear trend from a time series using the following technique:

1. Regress the dependent variable over a time sequence. For example if we have 12 months of time series observations the time sequence would be expressed as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12.
2. Use the regression analysis to predict a dependent value at each time interval. Then subtract the prediction from the actual value. The difference between actual and predicted value is known as the residual. The residuals array is the time series with the trend removed. You can now perform statistical analysis on the residuals.
Sounds complicated, but an example will make this more clear and Solr makes this all very easy to do.

### Example: Exploring the linear relationship between marketing spend and site usage.

In this example we want explore the linear relationship between marketing spend and website usage. The motivation for this is to determine if higher marketing spend causes higher website usage.

Website usage has been trending upwards for over a year. We have been varying the marketing spend throughout the year to experiment with how different levels of marketing spend impacts website usage.

Now we want to regress the marketing spend and the website usage to build a simple model of how usage is impacted by marketing spend. But before we can build this model we must remove the trend from the website usage or the cumulative effect of the trend will mask the relationship between marketing spend and website usage.

Here is the streaming expression:

let(a=timeseries(logs,
q="rec_type:page_view",
field="rec_time",
start="2016-01-01T00:00:00Z",
end="2016-12-31T00:00:00Z",
gap="+1MONTH",
count(*)),
b=jdbc(connection="jdbc:mysql://...",
sql="select marketing_expense from monthly_expenses where ..."),
c=col(a, count(*)),
d=col(b, marketing_expense),
e=sequence(length(c), 1, 1),
f=regress(e, c),
g=residuals(f, e, c),
h=regress(d, g),
tuple(regression=h))

Let's break down what this expression is doing:

1. The let expression is setting the variables a, b, c, d, e, f, g, h and returning a single result tuple.
2. Variable a is holding the result tuples from a timeseries function that is querying the logs for monthly usage counts.
3. Variable b is holding the result tuples from a jdbc function which is querying an external database for monthly marketing expenses.
4. Variable c is holding the output from a col function which returns the values in the count(*) field from the tuples stored in variable a. This is an array containing the monthly usage counts.
5. Variable d is holding the output from a col function which returns the values in the marketing_expense field from the tuples stored in variable bThis is an array containing the monthly marketing expenses.
6. Variable e holds the output of the sequence function which returns an array of numbers the same length as the array in variable c. The sequence starts from 1 and has a stride of 1.
7. Variable f holds the output of the regress function which returns a regression result. The regression is performed with the sequence in variable e as the independent variable and monthly usage counts in variable c as the dependent variable.
8. Variable g holds the output of the residuals function which returns the residuals from applying the regression result to the data sets in variables e and c. The residuals are the monthly usage counts with the trend removed.
9. Variable h holds the output of the regress function which returns a regression result. The regression is being performed with the marketing expenses (variable d) as the independent variable. The residuals from the monthly usage regression (variable g) are the dependent variable.  This regression result will describe the linear relationship between marketing expenses and site usage.
10. The output tuple is returning the regression result.

### New York - Coronavirus Statistics (NYTimes Data Set)

As of 2020-04-09 New York City - Cumulative Cases By Day New York City - Cumulative Deaths By Day ...