Sunday, November 5, 2017

An Introduction to Markov Chains in Solr 7.2

This blog introduces Solr's new Markov Chain implementation coming in Solr 7.2.  We'll first cover how Markov Chains work and then show how they are supported through the Streaming Expression statistical library.

Markov Chain

A Markov Chain uses probabilities to model the state transitions of a process. A simple example taken from the Markov Chain wiki page will help illustrate this.

In this example we'll be modeling the state transitions of the stock market. In our model the stock market can have three possible states:
  • Bull
  • Bear
  • Stagnant
There are two important characteristics of a Markov Process:

1) The process can only be in one state at a time.

2) The probability of transitioning between states is based only on the current state. This is known as "memorylessness". 

Below is a state diagram of the probabilities of transferring between the different states.



In the diagram above, the lines show the probabilities of transitioning between the different states. For example there is .075 probability of transitioning from a Bull market to a Bear market.  There is .025 probability of transitioning from a Bull market to a Stagnant market. There is a .9 probability of a Bull market transitioning to another Bull market.

The state transition probabilities in this example can be captured in a 3x3 matrix called a transition matrix. The transition matrix for this example is:
                    
                   Bull   Bear    Stagnant
Bull         |     .9     .075    .025    |
Bear         |     .15    .8      .05     |
Stagnant     |     .25    .25     .5      |

Notice each state has a row in the matrix. The values in the columns hold the transition probabilities for each state. 

For example row 0, column 0 is the probability of the Bull market transitioning to another Bull market. Row 1, column 0 is the probability of the Bear market transitioning to a Bull market.

A Markov Chain uses the transition matrix to model and simulate the transitions in the process. A code example will make this more clear.

Working with Matrices

In Solr 7.2 support for matrices has been added to the Streaming Expression statistical function library. Below is the expression for creating the example transition matrix:

let(a=array(.9, .075, .025),
    b=array(.15, .8, .05),
    c=array(.25, .25, .5),
    d=matrix(a, b, c))
  
In the expression above the rows of the matrix are created as numeric arrays and set to variables a, b and c. Then the arrays are passed to the matrix function to instantiate the matrix.

If we send this expression to Solr's /stream handler it responds with:

{ "result-set": { "docs": [ { "d": [ [ 0.9, 0.075, 0.025 ], [ 0.15, 0.8, 0.05 ], [ 0.25, 0.25, 0.5 ] ] }, { "EOF": true, "RESPONSE_TIME": 0 } ] } }

Markov Chain Simulations

Once the transition matrix is created its very easy to create a Markov Chain and simulate the process. Here is a sample expression:

let(a=array(.9, .075, .025),
    b=array(.15, .8, .05),
    c=array(.25, .25, .5),
    d=matrix(a, b, c),
    e=markovChain(d),
    f=sample(e, 5))

In the expression above the transition matrix is created and then passed as a parameter to the markovChain function. The markovChain function returns a Markov Chain for the specific transition matrix.

The Markov Chain can then be sampled using the sample function. In the example above 5 samples are taken from the Markov Chain. The samples represent the state that the process is in. This transition matrix has three states so each sample will either be 0 (Bull), 1 (Bear) or 2 (Stagnant).

Each time the Markov Chain is sampled it returns the next state of the process based on the transition probabilities of its current state.

If we send this expression to the /stream handler it may respond with:

{ "result-set": { "docs": [ { "f": [ 0, 0, 0, 2, 2 ] }, { "EOF": true, "RESPONSE_TIME": 0 } ] } }

Notice that 5 samples were returned 0, 0, 0, 2, 2. This corresponds to three consecutive Bull states followed by two Stagnant states. 

Finding the Long Term Average of the States

By increasing the number of samples we can determine how much time the process will spend in each state over the long term. Here is an example expression:

let(a=array(.9,  .075, .025),
    b=array(.15, .8,   .05),
    c=array(.25, .25,  .5),
    d=matrix(a, b, c),
    e=sample(markovChain(d), 200000),
    f=freqTable(e))

Notice that now instead of 5 samples we are taking 200,000 samples. Then we are creating a frequency table from the simulation array using the freqTable function. This will tell us the percentage of time spent in each state.

If we send this expression to the /stream handler it responds with the breakdown of the frequency table:

{ "result-set": { "docs": [ { "f": [ { "pct": 0.62636, "count": 125272, "cumFreq": 125272, "cumPct": 0.62636, "value": 0 }, { "pct": 0.310705, "count": 62141, "cumFreq": 187413, "cumPct": 0.937065, "value": 1 }, { "pct": 0.062935, "count": 12587, "cumFreq": 200000, "cumPct": 1, "value": 2 } ] }, { "EOF": true, "RESPONSE_TIME": 56 } ] } }

Notice in the response above that there are three tuples returned in the frequency table, one for each state (Bull, Bear, Stagnant).

The value field in each tuple is the numeric mapping of the state (0=Bull, 1=Bear, 2=Stagnant).

The pct field in each tuple is the percentage of time the value appears in the sample set. We can see that the process is in a Bull market 62.6% of the time, Bear market 31% and Stagnant 6.3% of the time.

Feature Scaling with Solr Streaming Expressions

Before performing machine learning operations its often important to scale the feature vectors so they can be compared at the same scale. In...