Tuesday, October 10, 2017

A Gentle Introduction to Monte Carlo Simulations in Solr 7.1

Monte Carlo simulations have been added to Streaming Expressions in Solr 7.1. This blog provides a gentle introduction to the topic of Monte Carlo simulations and shows how they are supported with the Streaming Expressions statistical function library.

Probability Distributions

Before diving into Monte Carlo simulations I'll briefly introduce Solr's probability distribution framework. We'll start slowly and cover just enough about probability distributions to support the Monte Carlo examples. Future blogs will go into more detail about Solr's probability distribution framework.

First let's start with a definition of what a probability distribution is. A probability distribution is a function which describes the probability of a random variable within a data set.

A simple example will help clarify the concept.

Uniform Integer Distribution

One commonly used probability distribution is the uniform integer distribution.

The uniform integer distribution is a function that describes a theoretical data set that is randomly distributed over a range of integers.

With the Streaming Expression statistical function library you can create a uniform integer distribution with the following function call:

uniformIntegerDistribution(1,6) 

The function above returns a uniform integer distribution with a range of 1 to 6.

Sampling the Distribution

The uniformIntegerDistribution function returns the mathematical model of the distribution. We can draw a random sample from the model using the sample function.

let(a=uniformIntegerDistribution(1, 6),
    b=sample(a))

In the example above the let expression is setting two variables:
  • a is set to output of the uniformIntegerDistribtion function, which is returning the uniform integer distribution model.
  • b is set to the output of the sample function which is returning a single random sample from the distribution.
Solr returns the following result from the expression above:

{ "result-set": { "docs": [ { "b": 4 }, { "EOF": true, "RESPONSE_TIME": 0 } ] } }

Notice in the output above the variable b = 4.  4 is the random sample taken from the uniform integer distribution. 

The Monte Carlo Simulation

We now know enough about probability distributions to run our first Monte Carlo simulation.

For our first simulation we are going to simulate the rolling of a pair of six sided dice.

Here is the code:

let(a=uniformIntegerDistribution(1, 6),
    b=uniformIntegerDistribution(1, 6),
    c=monteCarlo(add(sample(a), sample(b)), 10))

The expression above is setting three variables:
  • a is set to a uniform integer distribution with a range of 1 to 6.
  • b is also set to a uniform integer distribution with a range of 1 to 6.
  • c is set to the outcome of the monteCarlo function.
The monteCarlo function runs a function a specified number of times and collects the outputs into an array and then returns the array.

In the example above the function add(sample(a), sample(b)) is run 10 times.

Each time the function is called, a sample is drawn from the distribution models stored in the variables a and b.  The two random samples are then added together.

Each run simulates rolling a pair of dice. The results of the 10 rolls are gathered into an array and returned.

The output from the expression above looks like this:

{ "result-set": { "docs": [ { "c": [ 6, 6, 8, 8, 9, 7, 6, 8, 7, 6 ] }, { "EOF": true, "RESPONSE_TIME": 1 } ] } }

Counting the Results with a Frequency Table

The results of the dice simulation can be analyzed using a frequency table:

let(a=uniformIntegerDistribution(1, 6),
    b=uniformIntegerDistribution(1, 6),
    c=monteCarlo(add(sample(a), sample(b)), 100000), 
    d=freqTable(c))

Now we are running the simulation 100,000 times rather 10. We are then using the freqTable function to count the frequency of each value in the array. 

Sunplot provides a nice table view of the frequency table. The frequency table below shows the percent, count, cumulative frequency and cumulative percent for each value (2-12) in the simulation array.




Plotting the Results

Sunplot can also be used to plot specific columns from the frequency table.

In plot below the value column (2-12) from the frequency table is plotted on the x axis. The pct column (percent) from the frequency table is plotted on the y axis.



Below is the plotting expression:

let(a=uniformIntegerDistribution(1, 6),
    b=uniformIntegerDistribution(1, 6),
    c=monteCarlo(add(sample(a), sample(b)), 100000),
    d=freqTable(c),
    x=col(d, value),
    y=col(d, pct),
    plot(type=bar, x=x, y=y)) 

Notice that the x and y variables are set using the col function. The col function moves a field from a list of tuples into an array. In this case it's moving the the value and pct fields from the frequency table tuples into arrays.

We've just completed our first Monte Carlo simulation and plotted the results. As a bonus we've learned the probabilities of a craps game!

Simulations with Real World Data

The example above is using a theoretical probability distribution. There are many different theoretical distributions used in different fields. The first release of Solr's probability distribution framework includes some of the best known distributions including: the normal, log normal, poisson, uniform, binomial, gamma, beta, Wiebull and ZipF distributions.

Each of these distributions are designed to model a particular theoretical data set.

Solr also provides an empirical distribution function which builds a mathematical model based only on actual data.  Empirical distributions can be sampled in exactly the same way as theoretical distributions. This means we can mix and match empirical distributions and theoretical distributions in Monte Carlo simulations.

Let's take a very brief look at a Monte Carlo simulation using empirical distributions pulled from Solr Cloud collections.

In this example we are building a new product which is made up of steel and plastic. Both steel and plastic are bought by the ton on the open market. We have historical pricing data for both steel and plastic and we want to simulate the unit costs based on the historical data.

Here is our simulation expression:

let(a=random(steel, q="*:*", fl="price", rows="2000"),
    b=random(plastic, q="*:*", fl="price", rows="2000")
    c=col(a, price),
    d=col(b, price),
    steel=empiricalDistribtion(c),
    plastic=empiricalDistribtion(d),
    e=monteCarlo(add(mult(sample(steel), .0005), 
                     mult(sample(plastic), .0021)), 
                 100000),
    f=hist(e)

In the example above the let expression is setting the following variables:

  • a is set to the output of the random function. The random function is retrieving 2000 random tuples from the Solr Cloud collection containing steel prices.
  • is set to the output of the random function. The random function is retrieving 2000 random tuples from the Solr Cloud collection containing plastic prices.
  • c is set to the output of the col function, which is copying the price field from the tuples stored in variable a to an array. This is an array of steel prices.
  • is set to the output of the col function, which is copying the price field from the tuples stored in variable b to an array. This is an array of plastic prices.
  • The steel variable is set to the output of the empiricalDistribution function, which is creating an empirical distribution from the array of steel prices.
  • The plastic variable is set to the output of the empiricalDistribution function, which is creating an empirical distribution from the array of plastic prices.
  • e is set to the output of the monteCarlo function. The monteCarlo function runs the function with the formula for unit costs of steel and plastic. Random samples from the empirical distributions for steel and plastic are pulled for each run.
  • f is set to the output of the hist function. The hist function returns the histogram of the output from the pricing simulation. A histogram is used instead of the frequency table when dealing with floating point data.

Feature Scaling with Solr Streaming Expressions

Before performing machine learning operations its often important to scale the feature vectors so they can be compared at the same scale. In...