Tuesday, October 10, 2017

A Gentle Introduction to Monte Carlo Simulations in Solr 7.1

Monte Carlo simulations have been added to Streaming Expressions in Solr 7.1. This blog provides a gentle introduction to the topic of Monte Carlo simulations and shows how they are supported with the Streaming Expressions statistical function library.

Probability Distributions

Before diving into Monte Carlo simulations I'll briefly introduce Solr's probability distribution framework. We'll start slowly and cover just enough about probability distributions to support the Monte Carlo examples. Future blogs will go into more detail about Solr's probability distribution framework.

First let's start with a definition of what a probability distribution is. A probability distribution is a function which describes the probability of a random variable within a data set.

A simple example will help clarify the concept.

Uniform Integer Distribution

One commonly used probability distribution is the uniform integer distribution.

The uniform integer distribution is a function that describes a theoretical data set that is randomly distributed over a range of integers.

With the Streaming Expression statistical function library you can create a uniform integer distribution with the following function call:

uniformIntegerDistribution(1,6) 

The function above returns a uniform integer distribution with a range of 1 to 6.

Sampling the Distribution

The uniformIntegerDistribution function returns the mathematical model of the distribution. We can draw a random sample from the model using the sample function.

let(a=uniformIntegerDistribution(1, 6),
    b=sample(a))

In the example above the let expression is setting two variables:
  • a is set to output of the uniformIntegerDistribtion function, which is returning the uniform integer distribution model.
  • b is set to the output of the sample function which is returning a single random sample from the distribution.
Solr returns the following result from the expression above:

{ "result-set": { "docs": [ { "b": 4 }, { "EOF": true, "RESPONSE_TIME": 0 } ] } }

Notice in the output above the variable b = 4.  4 is the random sample taken from the uniform integer distribution. 

The Monte Carlo Simulation

We now know enough about probability distributions to run our first Monte Carlo simulation.

For our first simulation we are going to simulate the rolling of a pair of six sided dice.

Here is the code:

let(a=uniformIntegerDistribution(1, 6),
    b=uniformIntegerDistribution(1, 6),
    c=monteCarlo(add(sample(a), sample(b)), 10))

The expression above is setting three variables:
  • a is set to a uniform integer distribution with a range of 1 to 6.
  • b is also set to a uniform integer distribution with a range of 1 to 6.
  • c is set to the outcome of the monteCarlo function.
The monteCarlo function runs a function a specified number of times and collects the outputs into an array and then returns the array.

In the example above the function add(sample(a), sample(b)) is run 10 times.

Each time the function is called, a sample is drawn from the distribution models stored in the variables a and b.  The two random samples are then added together.

Each run simulates rolling a pair of dice. The results of the 10 rolls are gathered into an array and returned.

The output from the expression above looks like this:

{ "result-set": { "docs": [ { "c": [ 6, 6, 8, 8, 9, 7, 6, 8, 7, 6 ] }, { "EOF": true, "RESPONSE_TIME": 1 } ] } }

Counting the Results with a Frequency Table

The results of the dice simulation can be analyzed using a frequency table:

let(a=uniformIntegerDistribution(1, 6),
    b=uniformIntegerDistribution(1, 6),
    c=monteCarlo(add(sample(a), sample(b)), 100000), 
    d=freqTable(c))

Now we are running the simulation 100,000 times rather 10. We are then using the freqTable function to count the frequency of each value in the array. 

Sunplot provides a nice table view of the frequency table. The frequency table below shows the percent, count, cumulative frequency and cumulative percent for each value (2-12) in the simulation array.




Plotting the Results

Sunplot can also be used to plot specific columns from the frequency table.

In plot below the value column (2-12) from the frequency table is plotted on the x axis. The pct column (percent) from the frequency table is plotted on the y axis.



Below is the plotting expression:

let(a=uniformIntegerDistribution(1, 6),
    b=uniformIntegerDistribution(1, 6),
    c=monteCarlo(add(sample(a), sample(b)), 100000),
    d=freqTable(c),
    x=col(d, value),
    y=col(d, pct),
    plot(type=bar, x=x, y=y)) 

Notice that the x and y variables are set using the col function. The col function moves a field from a list of tuples into an array. In this case it's moving the the value and pct fields from the frequency table tuples into arrays.

We've just completed our first Monte Carlo simulation and plotted the results. As a bonus we've learned the probabilities of a craps game!

Simulations with Real World Data

The example above is using a theoretical probability distribution. There are many different theoretical distributions used in different fields. The first release of Solr's probability distribution framework includes some of the best known distributions including: the normal, log normal, poisson, uniform, binomial, gamma, beta, Wiebull and ZipF distributions.

Each of these distributions are designed to model a particular theoretical data set.

Solr also provides an empirical distribution function which builds a mathematical model based only on actual data.  Empirical distributions can be sampled in exactly the same way as theoretical distributions. This means we can mix and match empirical distributions and theoretical distributions in Monte Carlo simulations.

Let's take a very brief look at a Monte Carlo simulation using empirical distributions pulled from Solr Cloud collections.

In this example we are building a new product which is made up of steel and plastic. Both steel and plastic are bought by the ton on the open market. We have historical pricing data for both steel and plastic and we want to simulate the unit costs based on the historical data.

Here is our simulation expression:

let(a=random(steel, q="*:*", fl="price", rows="2000"),
    b=random(plastic, q="*:*", fl="price", rows="2000")
    c=col(a, price),
    d=col(b, price),
    steel=empiricalDistribtion(c),
    plastic=empiricalDistribtion(d),
    e=monteCarlo(add(mult(sample(steel), .0005), 
                     mult(sample(plastic), .0021)), 
                 100000),
    f=hist(e)

In the example above the let expression is setting the following variables:

  • a is set to the output of the random function. The random function is retrieving 2000 random tuples from the Solr Cloud collection containing steel prices.
  • is set to the output of the random function. The random function is retrieving 2000 random tuples from the Solr Cloud collection containing plastic prices.
  • c is set to the output of the col function, which is copying the price field from the tuples stored in variable a to an array. This is an array of steel prices.
  • is set to the output of the col function, which is copying the price field from the tuples stored in variable b to an array. This is an array of plastic prices.
  • The steel variable is set to the output of the empiricalDistribution function, which is creating an empirical distribution from the array of steel prices.
  • The plastic variable is set to the output of the empiricalDistribution function, which is creating an empirical distribution from the array of plastic prices.
  • e is set to the output of the monteCarlo function. The monteCarlo function runs the function with the formula for unit costs of steel and plastic. Random samples from the empirical distributions for steel and plastic are pulled for each run.
  • f is set to the output of the hist function. The hist function returns the histogram of the output from the pricing simulation. A histogram is used instead of the frequency table when dealing with floating point data.

Thursday, October 5, 2017

How to Model and Remove Time Series Seasonality With Solr 7.1

Often when working with time series data there is a cycle that repeats periodically. This periodic cycle is referred to as seasonality. Seasonality may have a large enough effect on the data that it makes it difficult to study other features of the time series. In Solr 7.1 there are new Streaming Expression statistical functions that allow us to model and remove time series seasonality.

If you aren't familiar with Streaming Expressions new statistical programming functions you may find it useful to read a few of the earlier blogs which introduce the topic.


Seasonality

Often seasonality appears in the data as periodic bumps or waves. These waves can be expressed as sine-waves. For this example we'll start off by generating some smooth sine-waves to represent seasonality. We'll be using Solr's statistical functions to generate the data and Sunplot to plot the sine-waves.

Here is a sample plot using Sunplot:



In the plot you'll notice there are waves in the data occurring at regular intervals. These waves represent the seasonality.

The expression used to generate the sine-waves is:

let(smooth=sin(sequence(100, 0, 6)),
    plot(type=line, y=smooth))        
 
In the function above the let expression is setting a single variable called smooth. The value set to smooth is an array of numbers generated by the sequence function that is wrapped and transformed by the sin function.

Then the let function runs the plot function with the smooth variable as the y axis. Sunplot then plots the data.

This sine-wave is perfectly smooth so the entire time series consists only of seasonality. To make things more interesting we can add some noise to the sign-waves to represent another component of the time series.



Now the expression looks like this:

let(smooth=sin(sequence(100, 0, 6)),
    noise=sample(uniformDistribution(-.25,.25),100),
    noisy=ebeAdd(smooth, noise),     
    plot(type=line, y=noisy))   


In the expression above we first generate the smooth sine-wave and set it to the variable smooth. Then we generate some random noise by taking a sample from a uniform distribution. The random samples will be uniformly distributed between -.25 and .25. The variable noise holds the array of random noise data.

Then the smooth and noise arrays are added together using the ebeAdd function. The ebeAdd function does an element-by-element addition of the two arrays and outputs an array with the results. This will add the noise to the sine-wave. The variable noisy holds this new noisy array of data.

The noisy array is then set to the y axis of the plot.

Now we have a time series that has both a seasonality component and noisy signal. Let's see how we can model and remove the seasonality so we can study the noisy component.

Modeling Seasonality 

We can model the seasonality using the new polyfit function to fit a curve to the data. The polyfit function is a polynomial curve fitter which builds a function that models non-linear data.

Below is a screenshot of the polyfit function:



Notice that now there is a smooth red curve which models the noisy time series. This is the smooth curve that the polyfit function fit to the noisy time series.

Here is the expression:

let(smooth=sin(sequence(100, 0, 6)),
    noise=sample(uniformDistribution(-.25,.25),100),
    noisy=ebeAdd(smooth,noise), 
    fit=polyfit(noisy, 16),
    x=sequence(100,0,1),          
    list(tuple(plot=line, x=x, y=noisy),
         tuple(plot=line, x=x, y=fit)))   

In the expression above we first build the noisy time series. Then the polyfit function is called on the noisy array with a polynomial degree. The degree describes the exponent size of the polynomial used in the curve fitting function. As the degree rises the function has more flexibility in the curves that it can model. For example a degree of 1 provides a linear model. You can try different degrees until you find the one that best fits your data set. In this example a 16 degree polynomial is used to fit the sine-wave.

Notice that when plotting two lines we use a slightly different plotting syntax. In the syntax above a list of output tuples is used to define the plot for Sunplot. When plotting two plots an x axis must be provided. The sequence function is used to generate an x axis.

 Removing the Seasonality

Once we've fit a curve to the time series we can subtract it away to remove the seasonality. After the subtraction what's left is the noisy signal that we want to study.

Below is a screenshot showing the subtraction of the fitted curve:


Notice that the plot now shows the data that remains after the seasonality has been removed. This time series is now ready to be studied without the effects of the seasonality.

Here is the expressions:

let(smooth=sin(sequence(100, 0, 6)),
    noise=sample(uniformDistribution(-.25,.25),100),
    noisy=ebeAdd(smooth,noise), 
    fit=polyfit(noisy, 16),
    stationary=ebeSubtract(noisy, fit),          
    plot(type=line, y=stationary))     
             
In the expression above the fit array, which holds the fitted curve, is subtracted from the noisy array. The ebeSubtract function performs the element-by-element subtraction. The new time series with the seasonality removed is stored in the stationary variable and plotted on the y axis.

Solr temporal graph queries for event correlation, root cause analysis and temporal anomaly detection

Temporal graph queries will be available in the 8.9 release of Apache Solr. Temporal graph queries are designed for key log analytics use ...