**Probability Distributions**

**Before diving into Monte Carlo simulations I'll briefly introduce Solr's probability distribution framework. We'll start slowly and cover just enough about probability distributions to support the Monte Carlo examples. Future blogs will go into more detail about Solr's probability distribution framework.**

First let's start with a definition of what a probability distribution is. A probability distribution is a function which describes the probability of a random variable within a data set.

A simple example will help clarify the concept.

**Uniform Integer Distribution**

One commonly used probability distribution is the

*uniform integer distribution*.

The uniform integer distribution is a function that describes a theoretical data set that is randomly distributed over a range of integers.

With the Streaming Expression statistical function library you can create a uniform integer distribution with the following function call:

uniformIntegerDistribution(1,6)

The function above returns a uniform integer distribution with a range of 1 to 6.

**Sampling the Distribution**

The uniformIntegerDistribution function returns the mathematical model of the distribution. We can draw a random sample from the model using the

**function.**

*sample*let(a=uniformIntegerDistribution(1, 6),

b=sample(a))

In the example above the

**expression is setting two variables:**

*let**a*is set to output of the uniformIntegerDistribtion function, which is returning the uniform integer distribution model.*b*is set to the output of thefunction which is returning a single random sample from the distribution.*sample*

Solr returns the following result from the expression above:

{
"result-set": {
"docs": [
{
"b": 4
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}

Notice in the output above the variable b = 4. 4 is the random sample taken from the uniform integer distribution.

**The Monte Carlo Simulation**

We now know enough about probability distributions to run our first Monte Carlo simulation.

For our first simulation we are going to simulate the rolling of a pair of six sided dice.

Here is the code:

b=uniformIntegerDistribution(1, 6),

c=monteCarlo(add(sample(a), sample(b)), 10))

The expression above is setting three variables:

*a*is set to a uniform integer distribution with a range of 1 to 6.is also set to a uniform integer distribution with a range of 1 to 6.*b*is set to the outcome of the monteCarlo function.*c*

The monteCarlo function runs a function a specified number of times and collects the outputs into an array and then returns the array.

In the example above the function add(sample(a), sample(b)) is run 10 times.

Each time the function is called, a sample is drawn from the distribution models stored in the variables

Each run simulates rolling a pair of dice. The results of the 10 rolls are gathered into an array and returned.

**and***a***The two random samples are then added together.***b.*Each run simulates rolling a pair of dice. The results of the 10 rolls are gathered into an array and returned.

The output from the expression above looks like this:

{
"result-set": {
"docs": [
{
"c": [
6,
6,
8,
8,
9,
7,
6,
8,
7,
6
]
},
{
"EOF": true,
"RESPONSE_TIME": 1
}
]
}
}

**Counting the Results with a Frequency Table**

The results of the dice simulation can be analyzed using a frequency table:

let(a=uniformIntegerDistribution(1, 6),

b=uniformIntegerDistribution(1, 6),

c=monteCarlo(add(sample(a), sample(b)), 100000),

d=freqTable(c))

Now we are running the simulation 100,000 times rather 10. We are then using the

**function to count the frequency of each value in the array.***freqTable*
Sunplot provides a nice table view of the frequency table. The frequency table below shows the

**percent**,**count**,**cumulative frequency**and**cumulative percent**for each**value (2-12)**in the simulation array.**Plotting the Results**

Sunplot can also be used to plot specific columns from the frequency table.

In plot below the

**value**column

**(2-12) from the frequency table is plotted on the**

**axis. The**

*x***column (percent) from the frequency table is plotted on the**

*pct***axis.**

*y*Below is the plotting expression:

let(a=uniformIntegerDistribution(1, 6),

b=uniformIntegerDistribution(1, 6),

c=monteCarlo(add(sample(a), sample(b)), 100000),

d=freqTable(c),

x=col(d, value),

y=col(d, pct),

plot(type=bar, x=x, y=y))

Notice that the

**and***x**y*variables are set using the**col**function. The**function moves a field from a list of tuples into an array. In this case it's moving the the***col***and***value***fields from the frequency table tuples into arrays.***pct*
We've just completed our first Monte Carlo simulation and plotted the results. As a bonus we've learned the probabilities of a craps game!

**Simulations with Real World Data**

The example above is using a

**theoretical probability distribution**. There are many different theoretical distributions used in different fields. The first release of Solr's probability distribution framework includes some of the best known distributions including: the normal, log normal, poisson, uniform, binomial, gamma, beta, Wiebull and ZipF distributions.

Each of these distributions are designed to model a particular theoretical data set.

Solr also provides an

**function which builds a mathematical model based only on actual data. Empirical distributions can be sampled in exactly the same way as theoretical distributions. This means we can mix and match**

*empirical distribution***and**

*empirical distributions***in Monte Carlo simulations.**

*theoretical distributions*Let's take a very brief look at a Monte Carlo simulation using empirical distributions pulled from Solr Cloud collections.

In this example we are building a new product which is made up of

**steel**and

**plastic**. Both steel and plastic are bought by the ton on the open market. We have historical pricing data for both steel and plastic and we want to simulate the unit costs based on the historical data.

Here is our simulation expression:

let(a=random(steel, q="*:*", fl="price", rows="2000"),

b=random(plastic, q="*:*", fl="price", rows="2000")

c=col(a, price),

d=col(b, price),

steel=empiricalDistribtion(c),

plastic=empiricalDistribtion(d),

e=monteCarlo(add(mult(sample(steel), .0005),

mult(sample(plastic), .0021)),

100000),

f=hist(e))

In the example above the

**expression is setting the following variables:**

*let**a*is set to the output of thefunction. The random function is retrieving 2000 random tuples from the Solr Cloud collection containing steel prices.*random*

*b*is set to the output of thefunction. The random function is retrieving 2000 random tuples from the Solr Cloud collection containing plastic prices.*random*

*c*is set to the output of thefunction, which is copying the*col*field from the tuples stored in variable*price*to an array. This is an array of*a*prices.*steel*

is set to the output of the*d*function, which is copying the*col*field from the tuples stored in variable*price*to an array. This is an array of*b*prices.*plastic*

- The
variable is set to the output of the empiricalDistribution function, which is creating an empirical distribution from the array of*steel*prices.*steel*

- The
variable is set to the output of the empiricalDistribution function, which is creating an empirical distribution from the array of*plastic*prices.*plastic*

is set to the output of the*e*function. The*monteCarlo*function runs the function with the formula for unit costs of steel and plastic. Random samples from the empirical distributions for steel and plastic are pulled for each run.*monteCarlo*

**f**is set to the output of the**hist**function. The hist function returns the histogram of the output from the pricing simulation. A histogram is used instead of the frequency table when dealing with floating point data.