Solr Analytics: time series

Showing posts with label time series. Show all posts

Thursday, October 5, 2017

How to Model and Remove Time Series Seasonality With Solr 7.1

Often when working with time series data there is a cycle that repeats periodically. This periodic cycle is referred to as seasonality. Seasonality may have a large enough effect on the data that it makes it difficult to study other features of the time series. In Solr 7.1 there are new Streaming Expression statistical functions that allow us to model and remove time series seasonality.

If you aren't familiar with Streaming Expressions new statistical programming functions you may find it useful to read a few of the earlier blogs which introduce the topic.

Seasonality

Often seasonality appears in the data as periodic bumps or waves. These waves can be expressed as sine-waves. For this example we'll start off by generating some smooth sine-waves to represent seasonality. We'll be using Solr's statistical functions to generate the data and Sunplot to plot the sine-waves.

Here is a sample plot using Sunplot:

In the plot you'll notice there are waves in the data occurring at regular intervals. These waves represent the seasonality.

The expression used to generate the sine-waves is:

let(smooth=sin(sequence(100, 0, 6)),
plot(type=line, y=smooth))

In the function above the let expression is setting a single variable called smooth. The value set to smooth is an array of numbers generated by the sequence function that is wrapped and transformed by the sin function.

Then the let function runs the plot function with the smooth variable as the y axis. Sunplot then plots the data.

This sine-wave is perfectly smooth so the entire time series consists only of seasonality. To make things more interesting we can add some noise to the sign-waves to represent another component of the time series.

Now the expression looks like this:

let(smooth=sin(sequence(100, 0, 6)),
noise=sample(uniformDistribution(-.25,.25),100),
noisy=ebeAdd(smooth, noise),
plot(type=line, y=noisy))

In the expression above we first generate the smooth sine-wave and set it to the variable smooth. Then we generate some random noise by taking a sample from a uniform distribution. The random samples will be uniformly distributed between -.25 and .25. The variable noise holds the array of random noise data.

Then the smooth and noise arrays are added together using the ebeAdd function. The ebeAdd function does an element-by-element addition of the two arrays and outputs an array with the results. This will add the noise to the sine-wave. The variable noisy holds this new noisy array of data.

The noisy array is then set to the y axis of the plot.

Now we have a time series that has both a seasonality component and noisy signal. Let's see how we can model and remove the seasonality so we can study the noisy component.

Modeling Seasonality

We can model the seasonality using the new polyfit function to fit a curve to the data. The polyfit function is a polynomial curve fitter which builds a function that models non-linear data.

Below is a screenshot of the polyfit function:

Notice that now there is a smooth red curve which models the noisy time series. This is the smooth curve that the polyfit function fit to the noisy time series.

Here is the expression:

let(smooth=sin(sequence(100, 0, 6)),
noise=sample(uniformDistribution(-.25,.25),100),
noisy=ebeAdd(smooth,noise),
fit=polyfit(noisy, 16),
x=sequence(100,0,1),
list(tuple(plot=line, x=x, y=noisy),
tuple(plot=line, x=x, y=fit)))

In the expression above we first build the noisy time series. Then the polyfit function is called on the noisy array with a polynomial degree. The degree describes the exponent size of the polynomial used in the curve fitting function. As the degree rises the function has more flexibility in the curves that it can model. For example a degree of 1 provides a linear model. You can try different degrees until you find the one that best fits your data set. In this example a 16 degree polynomial is used to fit the sine-wave.

Notice that when plotting two lines we use a slightly different plotting syntax. In the syntax above a list of output tuples is used to define the plot for Sunplot. When plotting two plots an x axis must be provided. The sequence function is used to generate an x axis.

Removing the Seasonality

Once we've fit a curve to the time series we can subtract it away to remove the seasonality. After the subtraction what's left is the noisy signal that we want to study.

Below is a screenshot showing the subtraction of the fitted curve:

Notice that the plot now shows the data that remains after the seasonality has been removed. This time series is now ready to be studied without the effects of the seasonality.

Here is the expressions:

let(smooth=sin(sequence(100, 0, 6)),
noise=sample(uniformDistribution(-.25,.25),100),
noisy=ebeAdd(smooth,noise),
fit=polyfit(noisy, 16),
stationary=ebeSubtract(noisy, fit),
plot(type=line, y=stationary))

In the expression above the fit array, which holds the fitted curve, is subtracted from the noisy array. The ebeSubtract function performs the element-by-element subtraction. The new time series with the seasonality removed is stored in the stationary variable and plotted on the y axis.

Wednesday, July 12, 2017

Detrending Time Series Data With Linear Regression in Solr 7

Often when working with time series data there is a linear trend present in the data. For example if a stock price has been gradually rising over a period of months you'll see a positive slope in the time series data. This slope over time is the trend. Before performing statistical analysis on the time series data it's often necessary to remove the trend.

Why is a trend problematic? Consider an example where you want to correlate two time series that are trending on a similar slope. Because they both have a similar slope they will appear to be correlated. But in reality they may be trending for entirely different reasons. To tell if the two time series are actually correlated you would need to first remove the trends and then perform the correlation on the detrended data.

Linear Regression

Linear regression is a statistical tool used to measure the linear relationship between two variables. For example you could use linear regression to determine if there is a linear relationship between age and medical costs. If a linear relationship is found you can use linear regression to predict the value of a dependent variable based on the value of an independent variable.

Linear regression can also be used to remove a linear trend from a time series.

Removing a Linear Trend from a Time Series

We can remove a linear trend from a time series using the following technique:

Regress the dependent variable over a time sequence. For example if we have 12 months of time series observations the time sequence would be expressed as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12.
Use the regression analysis to predict a dependent value at each time interval. Then subtract the prediction from the actual value. The difference between actual and predicted value is known as the residual. The residuals array is the time series with the trend removed. You can now perform statistical analysis on the residuals.

Sounds complicated, but an example will make this more clear and Solr makes this all very easy to do.

Example: Exploring the linear relationship between marketing spend and site usage.

In this example we want explore the linear relationship between marketing spend and website usage. The motivation for this is to determine if higher marketing spend causes higher website usage.

Website usage has been trending upwards for over a year. We have been varying the marketing spend throughout the year to experiment with how different levels of marketing spend impacts website usage.

Now we want to regress the marketing spend and the website usage to build a simple model of how usage is impacted by marketing spend. But before we can build this model we must remove the trend from the website usage or the cumulative effect of the trend will mask the relationship between marketing spend and website usage.

Here is the streaming expression:

let(a=timeseries(logs,

q="rec_type:page_view",

field="rec_time",

start="2016-01-01T00:00:00Z",

end="2016-12-31T00:00:00Z",

gap="+1MONTH",

count(*)),

b=jdbc(connection="jdbc:mysql://...",

sql="select marketing_expense from monthly_expenses where ..."),

c=col(a, count(*)),

d=col(b, marketing_expense),

e=sequence(length(c), 1, 1),

f=regress(e, c),

g=residuals(f, e, c),

h=regress(d, g),

tuple(regression=h))

Let's break down what this expression is doing:

The let expression is setting the variables a, b, c, d, e, f, g, h and returning a single result tuple.
Variable a is holding the result tuples from a timeseries function that is querying the logs for monthly usage counts.
Variable b is holding the result tuples from a jdbc function which is querying an external database for monthly marketing expenses.
Variable c is holding the output from a col function which returns the values in the count(*) field from the tuples stored in variable a. This is an array containing the monthly usage counts.
Variable d is holding the output from a col function which returns the values in the marketing_expense field from the tuples stored in variable b. This is an array containing the monthly marketing expenses.
Variable e holds the output of the sequence function which returns an array of numbers the same length as the array in variable c. The sequence starts from 1 and has a stride of 1.
Variable f holds the output of the regress function which returns a regression result. The regression is performed with the sequence in variable e as the independent variable and monthly usage counts in variable c as the dependent variable.
Variable g holds the output of the residuals function which returns the residuals from applying the regression result to the data sets in variables e and c. The residuals are the monthly usage counts with the trend removed.
Variable h holds the output of the regress function which returns a regression result. The regression is being performed with the marketing expenses (variable d) as the independent variable. The residuals from the monthly usage regression (variable g) are the dependent variable. This regression result will describe the linear relationship between marketing expenses and site usage.
The output tuple is returning the regression result.

Monday, May 1, 2017

Exploring Solr's New Time Series and Math Expressions

In Solr 6.6 the Streaming Expression library has added support for time series and math expressions. This blog will walk through an example of how to use these exciting features.

Time Series

Time series aggregations are supported through the timeseries Streaming Expression. The timeseries expression uses the json facet api under the covers so the syntax will be familiar if you've used Solr date range syntax.

Here is the basic syntax:

timeseries(collection,
field="test_dt",
q="*:*",
start="2012-05-01T00:00:00Z",
end="2012-06-30T23:59:59Z",
gap="+1MONTH",
count(*))

When sent to Solr this expression will return results that look like this:

{ "result-set": { "docs": [ { "test_dt": "2012-05-01T00:00:00Z", "count(*)": 247007 }, { "test_dt": "2012-06-01T00:00:00Z", "count(*)": 247994 }, { "EOF": true, "RESPONSE_TIME": 9 } ] } }

Solr takes care of the date math and builds the time range buckets automatically. Solr also fills in any gaps in the range with buckets automatically and adds zero aggregation values. Any Solr query can be used to select the records.

The supported aggregations are: count(*), sum(field), avg(field), min(field), max(field).

The timeseries function is quite powerful on it's own, but it grows in power when combined with math expressions.

Math Expressions

In Solr 6.6 the Streaming Expression library also adds math expressions. This is a larger topic then one blog can cover, but I'll hit some of highlights by slowly building up a math expression.

Let and Get

The fun begins with the let and get expressions. let is used to assign tuple streams to variables and get is used to retrieve the stream later in the expression. Here is the most basic example:

let(a=timeseries(collection, field="test_dt", q="*:*",
start="2012-05-01T00:00:00Z",
end="2012-06-30T23:59:59Z",
gap="+1MONTH",
count(*)),
get(a))

In the example above the timeseries expression is being set to the variable a. Then the get expression is used to turn the variable a back into a stream.

The let expression allows you to set any number of variables, and assign a single Streaming Expression to run the program logic. The expression that runs the program logic has access to the variables. The basic structure of let is:

let(a=expr,
b=expr,
c=expr,
expr)

The first three name/value pairs are setting variables and the final expression is the program logic that will use the variables.

If we send the let expression with the timeseries to Solr it returns with:

{ "result-set": { "docs": [ { "test_dt": "2012-05-01T00:00:00Z", "count(*)": 247007 }, { "test_dt": "2012-06-01T00:00:00Z", "count(*)": 247994 }, { "EOF": true, "RESPONSE_TIME": 9 } ] } }

This is the exact same response we would get if we sent the timeseries expression alone. Thats because all we did was assign the expression to a variable and use get to stream out the results.

Implementation Note: Under the covers the let expression sets each variable by executing the expressions and adding the tuples to a list. It then maps the variable name to the list in memory so that it can be retrieved by the variable name. So in memory Streams are converted to lists of tuples.

The Select Expression

The select expression has been around for a long time, but it now plays a central role in math expressions. The select expression wraps another expression and applies a list of Stream Evaluators to each tuple. Stream Evaluators perform operations on the tuples.

The Streaming Expression library now includes a base set of numeric evaluators for performing math on tuples. Here is an example of select in action:

let(a=timeseries(collection, field="test_dt", q="*:*",
start="2012-05-01T00:00:00Z",
end="2012-06-30T23:59:59Z",
gap="+1MONTH",
count(*)),
b=select(get(a),
mult(-1, count(*)) as negativeCount,
test_dt),
get(b))

In the example above we've set a timeseries to variable a.

Then we are doing something really interesting with variable b. We are transforming the timeseries tuples stored in variable a with the select expression.

The select expression is reading all the tuples from the get(a) expression and applying the mult stream evaluator to each tuple. The mult Streaming Evaluator is multiplying -1 to the value in the count(*) field of the tuples and assigning it to the field negativeCount. Select is also outputting the test_dt field from the tuples.

The transformed tuples are then assigned to variable b.

Then get(b) is used to output the transformed tuples. If you send this expression to Solr it outputs:

{ "result-set": { "docs": [ { "test_dt": "2012-05-01T00:00:00Z", "negativeCount": -247007 }, { "test_dt": "2012-06-01T00:00:00Z", "negativeCount": -247994 }, { "EOF": true, "RESPONSE_TIME": 9 } ] } }

Implementation Note: The get expression creates new tuples when it streams tuples from a variable. So you never have to worry about side effects. In the example above variable a was unchanged when the tuples were transformed and assigned to variable b.

The Tuple Expression

The basic data structure of Streaming Expressions is a Tuple. A Tuple is a set of name/value pairs. In the 6.6 release of Solr there is a Tuple expression which allows you to create your own output tuple. Here is the sample syntax:

The example above defines an output tuple with two fields: seriesA and seriesB, both of these fields have been assigned a variable. Remember that variables a and b are pointers to lists of tuples. This is exactly how they will be output by the tuple expression.

If you send the expression above to Solr it will respond with:

{ "result-set": { "docs": [ { "seriesA": [ { "test_dt": "2012-05-01T00:00:00Z", "count(*)": 247007 }, { "test_dt": "2012-06-01T00:00:00Z", "count(*)": 247994 } ], "seriesB": [ { "test_dt": "2012-05-01T00:00:00Z", "negativeCount": -247007 }, { "test_dt": "2012-06-01T00:00:00Z", "negativeCount": -247994 } ] }, { "EOF": true, "RESPONSE_TIME": 7 } ] } }

Now we have both the original time series and the transformed time series in the output.

The Col Evaluator

Lists of tuples are nice, but for performing many math operations what we need are columns of numbers. There is a special evaluator called col which can be used to pull out a column of numbers from a list of tuples.

Here is the basic syntax:

Now we have two new variables c and d, both pointing to a col expression. The col expression takes two parameters. The first parameter is a variable pointing to a list of tuples. The second parameter is the field to pull the column data from.

Also notice that there are two new fields in the output tuple that output the columns. If you send this expression to Solr it responds with:

Now the columns appear in the output.

Performing Math on Columns

We've seen already that there are numeric Stream Evaluators that work on tuples in the select expression.

Some numeric evaluators also work on columns. An example of this is the corr evaluator which performs the Pearson product-moment correlation calculation on two columns of numbers.

Here is the sample syntax:

let(a=timeseries(collection, field="test_dt", q="*:*",
start="2012-05-01T00:00:00Z",
end="2012-06-30T23:59:59Z",
gap="+1MONTH",
count(*)),
b=select(get(a),
mult(-1, count(*)) as negativeCount,
test_dt),
c=col(a, count(*)),
d=col(b, negativeCount),
tuple(seriesA=a,
seriesB=b,
columnC=c,
columnD=d,
correlation=corr(c, d)))

Notice that the tuple now has a new field called correlation with the output of the corr function set to it. If you send this to Solr it responds with:

{ "result-set": { "docs": [ { "seriesA": [ { "test_dt": "2012-05-01T00:00:00Z", "count(*)": 247007 }, { "test_dt": "2012-06-01T00:00:00Z", "count(*)": 247994 } ], "seriesB": [ { "test_dt": "2012-05-01T00:00:00Z", "negativeCount": -247007 }, { "test_dt": "2012-06-01T00:00:00Z", "negativeCount": -247994 } ], "columnC": [ 247007, 247994 ], "columnD": [ -247007, -247994 ], "correlation": -1 }, { "EOF": true, "RESPONSE_TIME": 6 } ] } }

Opening the Door to the Wider World of Mathematics

The syntax described in this blog opens the door to more sophisticated mathematics. For example the corr function can be used as a building block for cross-correlation, auto-correlation and auto-regression functions. Apache Commons Math includes machine learning algorithms such as clustering and regression and data transformations such as Fourier transforms that work on columns of numbers.

In the near future the Streaming Expressions math library will include these functions and many more.

Solr Analytics