Tuesday, May 30, 2017

Statistical programming with Solr Streaming Expressions

In the previous blog we explored the new timeseries function and introduced the syntax for math expressions. In this blog we'll dive deeper into math expressions and explore the statistical programming functions rolling out in the next release.

Let's first learn how the statistical expressions work and then look at how we can perform statistical analysis on retrieved result sets.

Array Math


The statistical functions create, manipulate and perform math on arrays. One of the basic things that we can do is create an array with the array function:

array(2, 3, 4, 3, 6)

The array function simply returns an array of numbers. If we send the array function above to Solr's stream handler it responds with:

{ "result-set": { "docs": [ { "return-value": [ 2, 3, 4, 3, 6 ] }, { "EOF": true, "RESPONSE_TIME": 1 } ] } }

Notice that the stream handler returns a single Tuple with the return-value field pointing to the array. This is how Solr responds when given a statistical function to evaluate.

This is a new behavior for Solr. In the past the stream handler always returned streams of Tuples. Now the stream handler can directly perform mathematical functions.

Let's explore a few more of the new array math functions. We can manipulate arrays in different ways. For example we can reverse the array like this:

rev(array(2, 3, 4, 3, 6))

Solr returns the following from this expression:

{ "result-set": { "docs": [ { "return-value": [ 6, 3, 4, 3, 2 ] }, { "EOF": true, "RESPONSE_TIME": 0 } ] } }

We can describe the array:

describe(array(2, 3, 4, 3, 6))

{ "result-set": { "docs": [ { "return-value": { "sumsq": 74, "max": 6, "var": 2.3000000000000003, "geometricMean": 3.365865436338599, "sum": 18, "kurtosis": 1.4555765595463175, "N": 5, "min": 2, "mean": 3.6, "popVar": 1.8400000000000003, "skewness": 1.1180799331493778, "stdev": 1.5165750888103102 } }, { "EOF": true, "RESPONSE_TIME": 31 } ] } }

Now we see our first bit of statistics. The describe function provides descriptive statistics for the array.

We can correlate arrays:

corr(array(2, 3, 4, 3, 6),
       array(-2, -3, -4, -3, -6))

This returns:

{ "result-set": { "docs": [ { "return-value": -1 }, { "EOF": true, "RESPONSE_TIME": 2 } ] } }


The corr function performs the Pearson Product Moment correlation on the two arrays. In this case the arrays are perfectly negatively correlated.

We can perform a simple regression on the arrays:

regress(array(2, 3, 4, 3, 6),
             array(-2, -3, -4, -3, -6))

{ "result-set": { "docs": [ { "return-value": { "significance": 0, "totalSumSquares": 9.2, "R": -1, "meanSquareError": 0, "intercept": 0, "slopeConfidenceInterval": 0, "regressionSumSquares": 9.2, "slope": -1, "interceptStdErr": 0, "N": 5 } }, { "EOF": true, "RESPONSE_TIME": 9 } ] } }


All statistical functions in the initial release are backed by Apache Commons Math. The initial release includes a core group of functions that support:

  • Rank transformations
  • Histograms
  • Percentiles
  • Simple regression and predict functions
  • One way ANOVA
  • Correlation
  • Covariance
  • Descriptive statistics
  • Convolution
  • Finding the delay in signals/time series
  • Lagged regression
  • Moving averages
  • Sequence generation
  • Calculating Euclidean distance between arrays
  • Data normalization and scaling
  • Array creation and manipulation functions
Statistical functions can be applied to:
  1.  Time series result sets
  2.  Random sampling result sets
  3.  SQL result sets (Solr's Internal Parallel SQL)
  4.  JDBC result sets (External JDBC Sources)
  5.  K-Nearest Neighbor results sets
  6.  Graph Expression result sets
  7.  Search result sets
  8.  Faceted aggregation result sets
  9.  MapReduce result sets 


Array Math on Solr Result Sets


Let's now explore how we can apply statistical functions on Solr result sets. In the example below we'll correlate arrays of moving averages for two stocks:

let(stockA = sql(stmt="select closing_price from price_data where ticker='stockA' and ..."),
     stockB = sql(stmt="select closing_price from price_data where ticker='stockB' and ..."),
     pricesA = col(stockA, closing_price),
     pricesB = col(stockB, closing_price),
     movingA = movingAvg(pricesA, 30),
     movingB = movingAvg(pricesB, 30),
     tuple(correlation=corr(movingA, movingB)))

Let's break down how this expression works:

1) The let expression is setting variables and then returning a single output tuple.
2) The first two variables stockA and stockB contain result sets from sql expressions. The sql expressions return tuples with the closing prices for stockA and stockB.
3) The next two variables pricesA and pricesB are created by the col function. The col function creates a numeric array from a list of Tuples. In this example pricesA contains the closing prices for stockA and pricesB contains the closing prices for stockB.
4) The next two variables movingA and movingB are created by the movingAvg function. In this example movingA and movingB contain arrays with the moving averages calculated from the pricesA and pricesB arrays.
5) In the final step we output a single Tuple containing the correlation of the movingA and movingB arrays. The correlation is computed using the corr function.



Monday, May 1, 2017

Exploring Solr's New Time Series and Math Expressions

In Solr 6.6 the Streaming Expression library has added support for time series and math expressions. This blog will walk through an example of how to use these exciting features.


Time Series


Time series aggregations are supported through the timeseries Streaming Expression. The timeseries expression uses the json facet api under the covers so the syntax will be familiar if you've used Solr date range syntax.

Here is the basic syntax:

timeseries(collection, 
                 field="test_dt", 
                 q="*:*",
                 start="2012-05-01T00:00:00Z",
                 end="2012-06-30T23:59:59Z",
                 gap="+1MONTH", 
                 count(*))

When sent to Solr this expression will return results that look like this:

{ "result-set": { "docs": [ { "test_dt": "2012-05-01T00:00:00Z", "count(*)": 247007 }, { "test_dt": "2012-06-01T00:00:00Z", "count(*)": 247994 }, { "EOF": true, "RESPONSE_TIME": 9 } ] } }

Solr takes care of the date math and builds the time range buckets automatically. Solr also fills in any gaps in the range with buckets automatically and adds zero aggregation values. Any Solr query can be used to select the records. 

The supported aggregations are: count(*), sum(field), avg(field), min(field), max(field).

The timeseries function is quite powerful on it's own, but it grows in power when combined with math expressions.


Math Expressions


In Solr 6.6 the Streaming Expression library also adds math expressions. This is a larger topic then one blog can cover, but I'll hit some of highlights by slowly building up a math expression.


Let and Get


The fun begins with the let and get expressions. let is used to assign tuple streams to variables and get is used to retrieve the stream later in the expression. Here is the most basic example:


let(a=timeseries(collection, field="test_dt", q="*:*",
                          start="2012-05-01T00:00:00Z",
                          end="2012-06-30T23:59:59Z",
                          gap="+1MONTH", 
                          count(*)),
      get(a))

In the example above the timeseries expression is being set to the variable a. Then the get expression is used to turn the variable a back into a stream.

The let expression allows you to set any number of variables, and assign a single Streaming Expression to run the program logic. The expression that runs the program logic has access to the variables. The basic structure of let is:

let(a=expr,
     b=expr,
     c=expr,
     expr)

The first three name/value pairs are setting variables and the final expression is the program logic that will use the variables.

If we send the let expression with the timeseries to Solr it returns with:

{ "result-set": { "docs": [ { "test_dt": "2012-05-01T00:00:00Z", "count(*)": 247007 }, { "test_dt": "2012-06-01T00:00:00Z", "count(*)": 247994 }, { "EOF": true, "RESPONSE_TIME": 9 } ] } }

This is the exact same response we would get if we sent the timeseries expression alone. Thats because all we did was assign the expression to a variable and use get to stream out the results.

Implementation Note: Under the covers the let expression sets each variable by executing the expressions and adding the tuples to a list. It then maps the variable name to the list in memory so that it can be retrieved by the variable name. So in memory Streams are converted to lists of tuples.


The Select Expression


The select expression has been around for a long time, but it now plays a central role in math expressions. The select expression wraps another expression and applies a list of Stream Evaluators to each tuple. Stream Evaluators perform operations on the tuples. 

The Streaming Expression library now includes a base set of numeric evaluators for performing math on tuples. Here is an example of select in action:

let(a=timeseries(collection, field="test_dt", q="*:*",
                          start="2012-05-01T00:00:00Z",
                          end="2012-06-30T23:59:59Z",
                          gap="+1MONTH", 
                          count(*)),
      b=select(get(a),  
                     mult(-1, count(*)) as negativeCount,
                     test_dt),
      get(b))

In the example above we've set a timeseries to variable a.

Then we are doing something really interesting with variable b. We are transforming the timeseries tuples stored in variable a with the select expression. 

The select expression is reading all the tuples from the get(a) expression and applying the mult stream evaluator to each tuple. The mult Streaming Evaluator is multiplying -1 to the value in the count(*) field of the tuples and assigning it to the field negativeCount. Select is also outputting the test_dt field from the tuples.

The transformed tuples are then assigned to variable b.

Then get(b) is used to output the transformed tuples. If you send this expression to Solr it outputs:

{ "result-set": { "docs": [ { "test_dt": "2012-05-01T00:00:00Z", "negativeCount": -247007 }, { "test_dt": "2012-06-01T00:00:00Z", "negativeCount": -247994 }, { "EOF": true, "RESPONSE_TIME": 9 } ] } }

Implementation Note: The get expression creates new tuples when it streams tuples from a variable. So you never have to worry about side effects. In the example above variable a was unchanged when the tuples were transformed and assigned to variable b.



The Tuple Expression


The basic data structure of Streaming Expressions is a Tuple. A Tuple is a set of name/value pairs. In the 6.6 release of Solr there is a Tuple expression which allows you to create your own output tuple. Here is the sample syntax:

let(a=timeseries(collection, field="test_dt", q="*:*",
                          start="2012-05-01T00:00:00Z",
                          end="2012-06-30T23:59:59Z",
                          gap="+1MONTH", 
                          count(*)),
      b=select(get(a),  
                     mult(-1, count(*)) as negativeCount,
                     test_dt),
      tuple(seriesA=a,
               seriesB=b))

The example above defines an output tuple with two fields: seriesA and seriesB, both of these fields have been assigned a variable. Remember that variables a and b are pointers to lists of tuples. This is exactly how they will be output by the tuple expression.

If you send the expression above to Solr it will respond with:

{ "result-set": { "docs": [ { "seriesA": [ { "test_dt": "2012-05-01T00:00:00Z", "count(*)": 247007 }, { "test_dt": "2012-06-01T00:00:00Z", "count(*)": 247994 } ], "seriesB": [ { "test_dt": "2012-05-01T00:00:00Z", "negativeCount": -247007 }, { "test_dt": "2012-06-01T00:00:00Z", "negativeCount": -247994 } ] }, { "EOF": true, "RESPONSE_TIME": 7 } ] } }

Now we have both the original time series and the transformed time series in the output.

The Col Evaluator


Lists of tuples are nice, but for performing many math operations what we need are columns of numbers. There is a special evaluator called col which can be used to pull out a column of numbers from a list of tuples.

Here is the basic syntax:

let(a=timeseries(collection, field="test_dt", q="*:*",
                          start="2012-05-01T00:00:00Z",
                          end="2012-06-30T23:59:59Z",
                          gap="+1MONTH", 
                          count(*)),
      b=select(get(a),  
                     mult(-1, count(*)) as negativeCount,
                     test_dt),
      c=col(a, count(*)),
      d=col(b, negativeCount),
      tuple(seriesA=a,
               seriesB=b,
               columnC=c,
               columnD=d))

Now we have two new variables c and d, both pointing to a col expression. The col expression takes two parameters. The first parameter is a variable pointing to a list of tuples. The second parameter is the field to pull the column data from.

Also notice that there are two new fields in the output tuple that output the columns. If you send this expression to Solr it responds with:

{ "result-set": { "docs": [ { "seriesA": [ { "test_dt": "2012-05-01T00:00:00Z", "count(*)": 247007 }, { "test_dt": "2012-06-01T00:00:00Z", "count(*)": 247994 } ], "seriesB": [ { "test_dt": "2012-05-01T00:00:00Z", "negativeCount": -247007 }, { "test_dt": "2012-06-01T00:00:00Z", "negativeCount": -247994 } ], "columnC": [ 247007, 247994 ], "columnD": [ -247007, -247994 ] }, { "EOF": true, "RESPONSE_TIME": 6 } ] } }

Now the columns appear in the output.

Performing Math on Columns


We've seen already that there are numeric Stream Evaluators that work on tuples in the select expression.

Some numeric evaluators also work on columns. An example of this is the corr evaluator which performs the Pearson product-moment correlation calculation on two columns of numbers.

Here is the sample syntax:

let(a=timeseries(collection, field="test_dt", q="*:*",
                          start="2012-05-01T00:00:00Z",
                          end="2012-06-30T23:59:59Z",
                          gap="+1MONTH", 
                          count(*)),
      b=select(get(a),  
                     mult(-1, count(*)) as negativeCount,
                     test_dt),
      c=col(a, count(*)),
      d=col(b, negativeCount),
      tuple(seriesA=a,
               seriesB=b,
               columnC=c,
               columnD=d,
               correlation=corr(c, d)))

Notice that the tuple now has a new field called correlation with the output of the corr function set to it. If you send this to Solr it responds with:

{ "result-set": { "docs": [ { "seriesA": [ { "test_dt": "2012-05-01T00:00:00Z", "count(*)": 247007 }, { "test_dt": "2012-06-01T00:00:00Z", "count(*)": 247994 } ], "seriesB": [ { "test_dt": "2012-05-01T00:00:00Z", "negativeCount": -247007 }, { "test_dt": "2012-06-01T00:00:00Z", "negativeCount": -247994 } ], "columnC": [ 247007, 247994 ], "columnD": [ -247007, -247994 ], "correlation": -1 }, { "EOF": true, "RESPONSE_TIME": 6 } ] } }


Opening the Door to the Wider World of Mathematics


The syntax described in this blog opens the door to more sophisticated mathematics. For example the corr function can be used as a building block for cross-correlation, auto-correlation and auto-regression functions. Apache Commons Math includes machine learning algorithms such as clustering and regression and data transformations such as Fourier transforms that work on columns of numbers.

In the near future the Streaming Expressions math library will include these functions and many more.

Wednesday, April 19, 2017

Having a chat with Solr using the new echo Streaming Expression

In the next release of Solr, there is a new and interesting Streaming Expression called echo.

echo is a very simple expression with the following syntax:

echo("Hello World")

If we send this to Solr, it responds with:

{ "result-set": { "docs": [ { "echo": "Hello World" }, { "EOF": true, "RESPONSE_TIME": 0 } ] } }

Solr simply echoes the text back, but maybe it feels a bit like Solr is talking to us. Like there might be someone there.

Well it turns out that this simple exchange is the first step towards a more meaningful conversation.

Let's take another step:

classify(echo("Customer service is just terrible!"),
             model(models, id="sentiment"),
             field="echo",
             analyzerField="message_t")

Now we are echoing text to a classifier.  The classify function is pointing to a model stored in Solr that does sentiment analysis based on the text. Notice that the classify function has an analyzer field parameter. This is a Lucene/Solr analyzer used by the classify function to pull the features from the text (See this blog for more details on the classify function).

If we send this to Solr we may get a response like this:

{ "result-set": { "docs": [ { "echo": "Customer service is just terrible!",
"probability_d":0.94888 }, { "EOF": true, "RESPONSE_TIME": 0 } ] } }

The probability_d field is the probability that the text has a negative sentiment. In this case there was a 94% probability that the text was negative.

Now Solr knows something about what's being said. We can wrap other Streaming Expressions around this to take actions or begin to formulate a response.

But we really don't yet have enough information to make a very informed response.

We can take this a bit further.

Consider this expression:

select(echo("Customer service is just terrible!"),
           analyze(echo, analyzerField) as expr_s)

The expression above uses the select expression to echo the text to the analyze Stream Evaluator. The analyze Steam Evaluator applies a Lucene/Solr analyzer to the text and returns a token stream. But in this case it returns a single token which is a Streaming Expression. 

(See this blog for more details on the analyze Stream Evaluator)

In order to make this work you would define the final step of the analyzer chain as a token filter that builds a Streaming Expression based on the natural language parsing done earlier in the analyzer chain.

Now we can wrap this construct in the new eval expression:

eval(select(echo("Customer service is just terrible!"),
                  analyze(echo, analyzerField) as expr_s))

The eval expression will compile and run the Streaming Expression created by the analyzer.  It will also emit the tuples that are emitted by the compiled expression. The tuples emitted are the response to the natural language request.

The heavy lifting is done in the analysis chain which performs the NLP and generates the Streaming Expression response.

Streaming Expressions as an AI Language

Before Streaming Expressions existed Dennis Gove shared an email with me with his initial design for the Streaming Expression syntax. The initial syntax used Lisp like S-Expressions. I took one look at the S-Expressions and realized we were building an AI language. I'll get into more detail about how this syntax ties into AI shortly, but first a little more history on Streaming Expressions.

The S-Expressions were replaced with the more familiar function syntax that Streaming Expressions has today. This decision was made by Dennis and Steven Bower. It turned out to be the right call because we now have a more familiar syntax than Lisp but we also kept many of Lisps most important qualities.

Dennis contributed the Streaming Expression parser and I began looking for something interesting to do with it. The very first thing I tried to do with Streaming Expressions was to re-write SQL queries as Streaming Expressions for the Parallel SQL interface. For this project a SQL parser was used to parse the queries and then a simple planner was built that generated Streaming Expressions to implement the physical query plan.

This was an important proving ground for Streaming Expressions for a number of reasons. It proved that Streaming Expressions could provide the functionality needed to implement the SQL query plans. It proved that Streaming Expressions could push functionality down into the search engine and also rise above the search engine using MapReduce when needed.

Most importantly from an AI standpoint it proved that we could easily generate Streaming Expressions programmatically. This was one of the key features that made Lisp a useful AI Language. The reason that Streaming Expressions are so easily generated is that the syntax is extremely regular. There are only nested functions. And because Streaming Expressions have an underlying Java object representation, we didn't have to do any String manipulation. We could work directly with the Object tree structure to build the expressions.

Why is code generation important for AI? One of the reasons is shown earlier in this blog. A core AI use case is to respond to natural language requests. One approach to doing this is to analyze the text request and then generate code to implement a response. In many ways it's similar to the problem of translating SQL to a physical query plan.

In a more general sense code generation is important in AI because you're dealing with many unknowns so it can be difficult to code everything up front. Sometimes you may need to generate logic on the fly.

Domain Specific Languages

Lisp has the capability of adapting its syntax for specific domains through it's powerful macro feature. Streaming Expressions has this capability as well, but it does it a different way.

Each Streaming Expression is implemented in Java under the covers. Each Streaming Expression is responsible for parsing it's own parameters. This means you can have Streaming Expressions that invent their own little languages. The select expression is a perfect example of this.

The basic select expression looks like this:

select(expr, fielda as outField)

This select reads tuples from a stream and outputs fielda as outField. The Streaming Expression parser has no concept of the word "as". This is specific to the select expression and the select expression handles the parsing of "as".

The reason why this works is that under the covers Streaming Expressions see all parameters as lists that it can manipulate any way it wants.

Embedded In a Search Engine

Having an AI language embedded in a search engine is a huge advantage. It allows expressions to leverage vast amounts of information in interesting ways. The inverted index already has important statistics about the text which can be used for machine learning. Search engines have strong facilities for working with text (tokenizers, filters etc..) and in recent years they've become powerful column stores for numeric calculations. They also have mature content ingestion and parallel query frameworks.

Now there is a language that ties it all together.

Thursday, March 30, 2017

Streaming NLP is coming in Solr 6.6

Solr 6.5 is out now, so it's time to start thinking about the next release. One of the interesting features coming in Solr 6.6 is Streaming NLP. This exciting new feature is already committed and waiting for release. This blog will describe how Streaming NLP works.

The analyze Stream Evaluator


One of the features added in Solr 6.5 was Stream Evaluators. Stream Evaluators perform operations on Tuples in the stream. There are already a rich set of math and boolean Stream Evaluators in Solr 6.5 and more coming in Solr 6.6. The math and boolean Stream Evaluators allow you to build complex boolean logic and mathematical formulas on Tuples in the stream.

Solr 6.6 also has a new Stream Evaluator, called analyze, that works with text. The analyze evaluator applies a Lucene/Solr analyzer to a text field in the Tuples and returns a list of tokens produced by the analyzer. The tokens can then by used to annotate Tuples or streamed out as Tuples. We'll show examples of both approaches later in the blog.

But it's useful to talk about the power behind Lucene/Solr analyzers first. Lucene/Solr has a large set of analyzers that tokenize different languages and apply filters that transform the token stream. The "analyzer chain" design allows you to chain tokenizers and filters together to perform very powerful text transformations and extractions.

The analysis chain also provides a pluggable API for adding new NLP tokenizers and filters to Solr. New tokenizers and filters can be added and then layered with existing tokenizers and filters in interesting ways. New NLP analysis chains can then be used both during indexing and with Streaming NLP.

The cartesianProduct Streaming Expression


The cartesianProduct Streaming Expression is also new in Solr 6.6. The cartesianProduct expression emits a stream of Tuples from a single Tuple by creating a cartesian product from a multi-valued field or a text field. The analyze Stream Evaluator is used with the cartesianProduct Streaming Expression to create a cartesian product from a text field.

Here is a very simple example:

For this example we have indexed a single record in Solr with an id and text field called body:

id: 1
body: "c d e f g"

The following expression will create a cartesian product from this Tuple:

cartesianProduct(search(collection, q="id:1", fl="id, body", sort="id desc"),
                              analyze(body, analyzerField) as outField)

First let's look at what this expression is doing then look at the output.

The cartesianProduct expression is wrapping a search expression and an analyze Stream Evaluator. The cartesianProduct expression reads the Tuples returned by the search expression and applies the analyze Stream Evaluator to each Tuple. (Note that the cartesianProduct expression can read Tuples from any Streaming Expression.)

The analyze Stream Evaluator is taking the text from the body field in the Tuple and is applying an analyzer found in the schema which is pointed to by the analyzerField parameter.

The cartesianProduct function emits a single Tuple for each token produced by the analyzer. For example if we have a basic white space tokenizing analyzer the Tuples emitted would be:

id: 1
outField: c

id: 1
outField: d

id: 1
outField: e

id: 1
outField: f

id: 1
outField: g

Creating Entity Graphs


The Tuples emitted by the cartesianProduct and the analyze evaluator can be saved to another Solr Cloud collection with the update stream. This allows you to build graphs from extracted entities that can then be walked with Solr Graph Expressions.

Annotating Tuples


The analyze Stream Evaluator can also be used with the select Streaming Expression to annotate Tuples with tokens extracted by an analyzer. Here is the sample syntax:

select(search(collection, q="id:1", fl="id, body", sort="id desc"),
          id,
          analyze(body, analyzerField) as outField)

This will add a field to each Tuple which will contain the list of Tuples extracted by the analyzer. The update function can be used to save the annotated Tuples to another Solr Cloud collection.

Scaling Up


Solr's parallel batch and executor framework can be used to apply a massive amount of computing power to perform NLP on extremely large data sets. You can read about the parallel batch and the executor framework in these blogs:



Sunday, March 12, 2017

Solr 6.5: Retrieve and rank with graph expressions

This blog describes how to retrieve and rank documents with graph expressions. First let's define exactly what it means to retrieve and rank with a graph expression and then we'll walk through an example.

The retrieve step is a relevance ranked search. The rank step re-ranks the top N documents based on the results of a graph expression.

Why would we want to do this? I think its easiest to explain this with an example.

Re-Ranking Based On A Users "Work Graph"


Before diving into the example, it's important to understand that this re-ranking strategy is designed to provide sub-second response times. It's also designed to adapt in real-time as users use the system and work graphs are updated.

Ok, let's dive into the example.

In this example when users perform a search the top N results are re-ranked to boost documents that are part of their work graph. To find a users work graph, a graph expression is used to mine usage logs in real time to find documents that are closely related to the users work.

This relevance strategy can be useful for systems where users are working with documents and performing searches to find documents. One example of this type of system is Alfresco, which provides an Enterprise Content Management system, that uses Solr for search. Alfresco logs when users read and edit documents. These logs can then be mined with graph expressions to discover users work graphs.


The Re-Rank Expression


The re-rank expression looks like this:

top(n=50,
      sort="rescore desc",
      select(id,
                if(eq(nodeScore, null), score, mult(score, log(nodeScore))) as rescore,
                outerHashJoin(${search}, hashed=${graph}, on="id=node")))

Notice the outerHashJoin refers to ${search} and ${graph} variables. This is using Solr's built in macro expansion capability. ${search} and ${graph} are referring to http parameters that point to the search and graph Streaming Expressions. This is a great way to break up long Streaming Expressions into manageable pieces and also create re-usable parameterized templates.

We'll first explore the re-rank expression above, then we'll look at the ${search} and ${graph} expressions.

Let's start pulling apart the re-rank expression by looking at the outerHashJoin expression. The outerHashJoin expression joins two expressions ${search} and ${graph}. The join keys are the id field from the ${search} tuples and node field from the ${graph} tuples.

The outerHashJoin emits all tuples from the ${search} expression whether there is a matching tuple from the ${graph} expression or not. If there is a match found from the ${graph} expression then it's fields are added to the matching ${search} tuple.

We'll look at the specifics of the ${search} and ${graph} expression below, but at a high level they are:

1) search: A full text search result.
2) graph: The documents that are closely related to the users work a.k.a. the users work graph.

Let's move on to the select expression that is wrapping the outerHashJoin. The select function selects specific fields from tuples and performs field level transformations on tuples. These field level operations known as Evaluators were significantly expanded by Dennis Gove in Solr 6.5.

In the example, the select function operates over each tuple emitted by the outerHashJoin. It emits the id field for every tuple and a new derived field called rescore. 

The rescore field is derived from a specific formula in red below:

if(eq(nodeScore, null), score, mult(score, log(nodeScore))) as rescore

This formula is expressed using the new Evaluators. Translated into plain english the formula is:

if the nodeScore field is null, then use the score field.
else
multiply the score field by the natural log of the nodeScore field.

The nodeScore field is assigned to documents emitted by the ${graph} expression. It describes how relevant the document is to the users work graph.

The score is the score assigned to documents by the ${search} expression. It describes how relevant the document is to the full text search.

Notice in the formula that the score is always present. But the nodeScore can be null. This is because only documents in the search result that are in the users work graph will have a nodeScore assigned during the outer join.

Also notice the tuples that contain a nodeScore are boosted by multiplying the log of the nodeScore and the score. The documents that don't have a nodeScore don't receive this boost. This boosts documents that are part of the users work graph.

In the final step the top expression emits the top 50 tuples sorted by rescore desc. This is the re-ranked result set.

We spent quite a bit of time going through the re-rank expression, so let's spend a little time on the ${search} and ${graph} expressions.


The Search Expression


In this example we'll use a very simple search expression that looks like this:

search(content,
           q="natural gas",
           fl="id, score",
           rows="100",
           sort="score desc")

This expression searches the content collection in the default field for the terms natural gas. The expression will return the id and score fields and sort by score descending. The rows parameter is set to 100 which means it will fetch 100 rows from each shard, rather the 100 rows total. So if there are 4 shards this will return up to 400 results.

The search expression is really designed to provide input to other streaming expressions, so it simply merges the results from the shards into a single stream and maintains the sort order.

The Graph Expression


The graph expression is designed to query usage logs to return documents that are part of a users work graph.

Here is the graph expression we will be using for this example:

scoreNodes(nodes(logs,
                               top(n=20,
                                       sort="count(*) desc",
                                       nodes(logs,
                                                  nodes(logs,
                                                             walk="joel->userID",
                                                             gather="contentID"),
                                                  walk="node->contentID",
                                                  gather="userID",
                                                  count(*))),
                               walk="node->userID",
                               gather="contentID",
                               count(*)))    


Working our way outwards from the innermost nodes expression (Note that nodes is an alias for the gatherNodes expression):

1) The innermost nodes expression gathers all contentID's from the logs where the userID is joel.
2) Working outwards, the next nodes expression takes all the contentID's emitted from step 1 and gathers all the userID's that have viewed these contentID's. It also counts how many contentID's each user has viewed that joel has viewed.
3) The top expression emits the top 20 users that have viewed the most overlapping content with joel.
4) The outermost nodes expression gathers all the contentID's viewed by the users emitted in step 3.
5) The scoreNodes expression scores all the contentID's emitted by step 4. This adds the nodeScore field to the tuples which describes how relevant each contentID is to the users work graph.

This graph expression will emit all the contentID's in the users work graph. The contentID in each tuple will be in the node field. This is why the outerHashJoin in the re-rank expression is joining the id field in the ${search} expression to the node field in the ${graph} expression.

Monday, February 27, 2017

Anomaly Detection in Solr 6.5

Solr 6.5 is just around the corner and along with it comes the new significantTerms Streaming Expression. The significantTerms expression queries a Solr Cloud collection but instead of returning the matching documents, it returns the significant terms in the matching documents.

To determine the significance of a term a formula is used which considers the number of times the term appears in the foreground set versus the number of times the term appears in the background set. The foreground set is the search result. The background set is all the documents in the index.

The significantTerms function assigns higher scores to terms that are more frequent in the foreground set and rarer in the background set, in relation to other terms.

For example:

Term     Foreground    Background
A           100                   103
B           101                   1000

Term A would be considered more significant then term B, because term A is much more rare in the background set.

This model for scoring terms can be very useful for spotting anomalies in the data. Specifically we can easily surface terms that are unusually aligned with specific result sets.


A Simple Example with the Enron Emails


For this example we'll start with a single Enron email address (tana.jones@enron.com) and ask the question:

Which address has the most significant relationship with tana.jones@enron.com?

We can start looking for an answer by running an aggregation. Since we're using Streaming Expressions we'll use the facet expression:

facet(enron,
         q="from:tana.jones@enron.com",
         buckets="to",
         bucketSorts="count(*) desc",
         bucketSizeLimit="100",
         count(*))

This expression queries the index for tana.jones@enron.com in the from field and gathers the facet buckets and counts from the to field. It returns the top 100 facet buckets from the to field ordered by the counts in descending order.

This expression returns the top 100 addresses that tana.jones@enron.com has emailed. The top five results look like this:

{
"result-set": {
"docs": [{
"count(*)": 789,
"to": "alan.aronowitz@enron.com"
}, {
"count(*)": 376,
"to": "frank.davis@enron.com"
}, {
"count(*)": 372,
"to": "mark.taylor@enron.com"
}, {
"count(*)": 249,
"to": "brent.hendry@enron.com"
}, {
"count(*)": 197,
"to": "bob.bowen@enron.com"
}, ...

This gives some useful information but does it answer the question?  The top address is alan.aronowitz@enron.com with a count of 789. Is this the most significant relationship?

Let's see if the significantTerms expression can surface an anomaly. Here is the expression:

significantTerms(enron, q="from:tana.jones@enron.com", field="to", limit="20")

The expression above runs the query from:tana.jones@enron.com on the enron collection. It then collects the top 20 significant terms from the to field.

The top five results look like this:

{
"result-set": {
"docs": [{
"score": 54.370163,
"term": "michael.neves@enron.com",
"foreground": 130,
"background": 132
}, {
"score": 53.911552,
"term": "lisa.lees@enron.com",
"foreground": 186,
"background": 243
}, {
"score": 53.806202,
"term": "frank.davis@enron.com",
"foreground": 376,
"background": 596
}, {
"score": 51.760098,
"term": "harry.collins@enron.com",
"foreground": 106,
"background": 150
}, {
"score": 51.471268,
"term": "edmund.cooper@enron.com",
"foreground": 132,
"background": 222
}

We have indeed surfaced an interesting anomaly. The first term is michael.neves@enron.com. This address has a foreground count of 130 and background count of 132. This means that michael.neves@enron.com has received 132 emails in the entire corpus and 130 of them have been from tana.jones@enron.com. This signals a strong connection.

alan.aronowitz@enron.com, the highest total receiver of emails from tana.jones@enron.com, isn't in the top 5 results from the significantTerms function.

alan.aronowitz@enron.com shows up at number 8 in the list:
{
"score": 49.847652,
"term": "alan.aronowitz@enron.com",
"foreground": 789,
"background": 2117
}

Notice that the foreground count is 789 and background count is 2117. This means that 37% of the emails received by alan.aronowitz@enron.com were from tana.jones@enron.com.

98% of the emails received by michael.neves@enron.com came from tana.jones@enron.com.

significantTerms VS scoreNodes


The significantTerms function works directly with the inverted index and can score terms from a single-value, multi-value and text fields.

The scoreNodes function scores tuples emitted by graph expressions. This allows for anomaly detection in distributed graphs. A prior blog covers the scoreNodes function in more detail.

In Solr 6.5 the scoreNodes scoring algorithm was changed to better surface anomalies. The significantTerms and scoreNodes functions now use the same scoring algorithm.

Use Cases


Anomaly detection has interesting use cases including:

1) Recommendations: Finding products that are unusually connected based on past shopping history.

2) Auto-Suggestion: Suggesting terms that go well together based on indexed query logs.

3) Fraud Anomalies: Finding vendors that are unusually associated with credit card fraud.

4) Text Analytics: Finding significant terms relating to documents in a full text search result set.

5) Log Anomalies: Finding IP addresses that are unusually associated with time periods of suspicious activity.

Thursday, February 16, 2017

Solr's Shiny New Apache Calcite SQL Integration


Solr's new Apache Calcite SQL integration is in Lucene/Solr's master branch now. This blog will discuss Solr's potential as a distributed SQL engine, some thoughts on Apache Calcite, what's currently supported with Solr's Apache Calcite integration and what might be coming next.


Solr as Distributed SQL Engine


The initial SQL interface, which released with Solr 6.0, uses the Presto project's SQL parser to parse the SQL and then rewrites the queries as Solr Streaming Expressions.

The goals of the initial SQL release were focused around supporting SQL aggregations using both MapReduce and Solr's native faceting capabilities. And also to support Solr's full text query language in the SQL predicate.

Even with the limited set of goals it became clear that Solr could be a special SQL engine. There are very few distributed SQL engines that can push down so much processing into the engine, and also rise up above the engine when needed to perform streaming parallel relational algebra. And then of course there is the predicate. Few existing SQL engines can compete with Solr's rich search predicates which have been developed over a period of 10+ years.

Last but not least is the performance. Solr is not a batch engine adapted for request/response. Solr is a request/response engine from the ground up. Solr's search performance, analytic performance and instant streaming capabilities make it one of the fastest distributed SQL engines available.

The Apache Calcite Integration


Kevin Risden broke ground on Solr's Apache Calcite integration in March of 2016. It wasn't clear at that time that Apache Calcite would be the right fit, but over time it became clear that it was exactly what we needed. There were really two choices for how we could implement the next phase of Solr's SQL integration:

1) Stick with the approach of using a SQL Parser only, and work directly with the parse tree to build the physical query plan in Streaming Expressions.

2) Use a broader framework and plugin rules that would push down parts of the SQL query that we wanted to control.

There were pros and cons to both approaches. The main pro for using just a SQL parser is that we would have total control over the process once the query was parsed. This means we would never run into a scenario where we couldn't implement something that leveraged Solr's capabilities.

The main con to just using the SQL parser is that we would have been responsible for implementing everything, including things like a complete JDBC driver. This is not a small undertaking.

The main pros and cons of using a broader framework were exactly the opposite: less control but the ability to leverage existing features in the framework.

Kevin was very much in favor of using a broader framework, but I was not convinced that we could take full advantage of Solr's capabilities unless we controlled everything.

But in the end Kevin broke ground embedding the Apache Calcite framework into Solr. In the open source world, working code tends to win out. Based on his initial work, I agreed that we should move forward using the wider Apache Calcite framework.

Kevin continued working on the integration. Along the way Cao Manh Dat joined in and added the aggregation support to the branch that Kevin was working on.

Eventually I joined in as well, building on top of the work that Kevin and Dat already contributed. My main focus was to ensure that the initial Apache Calcite implementation was comparable in features and performance to Solr's existing SQL integration.

As I spent more time working with Apache Calcite I came to really appreciate what the project offered. Apache Calcite allows you to selectively push down parts of the SQL implementation such as the predicate, sort, aggregation and joins. It gives you almost full control if you want it, but allows you to leverage any part of the framework that you choose to use. For example you can push down nothing if you want, or you can push down just the predicate or just the sort.

Apache Calcite also provides two very important things: a cost based query optimizer and a JDBC driver. Solr's initial JDBC driver only implemented part of the specification and the specification is large. Implementing a cost based query optimizer is a daunting task. With Apache Calcite we get these features almost for free. We still have to provide hooks into them, but we don't have to implement them.

What's Currently Supported in Lucene/Solr Master


The initial Apache Calcite integration includes:

  • limited and unlimited selects. Unlimited selects stream the entire result set regardless of the size.
  • Support for Solr search predicates including support for embedding an entire Solr query using the "_query_" field. This means all Solr query syntax is supported including complex full text, graph query, geo-spatial, fuzzy, moreLikeThis etc.
  • Support for score in the field list and order by in queries that have a limit clause.
  • Support for field aliases.
  • Support for faceted and MapReduce aggregations with the aggregationMode parameter.
  • Support for aggregations on multi-valued fields when in facet mode.
  • Support for multi-value fields in simple selects.
  • Parallel execution of MapReduce aggregations on worker nodes.
  • Support for aggregations without a group by clause. These are always pushed down into the search engine.
  • Support for select distinct queries in both faceted and MapReduce mode.
  • Support for group by aggregations in both faceted and MapReduce mode.
  • Support for sorting on fields in the index as well as sorting on aggregations.
  • Support for the having clause. In MapReduce mode the having clause is pushed down to the worker nodes now.


What's Coming Next


Now that the Apache Calcite integration is in master we are free to begin adding new features on a regular basis. Here are some features that are on the top of the list:

1) Support for * in the field list.
2) Automatic selection of aggregationMode (facet or MapReduce). Having Solr choose the right aggregation mode based on the cardinality of fields being aggregated.
3) Support for SELECT ... INTO ...
4) Support for arithmetic operations on fields and aggregations (select (a*b) as c from t). This is now supported in Streaming Expressions (SOLR-9916).
5) Expanded aggregation support.
5) Support for UNION, INTERSECT, JOIN, using Streaming Expressions parallel relational algebra capabilities and Apache Calcites query optimizer.

Tuesday, February 14, 2017

Recommendations With Solr's Graph Expressions Part 1: Finding Products That Go Well Together

Graph Expressions were introduced in Solr 6.1. Graph Expressions are part of the wider Streaming Expressions library. This means that you can combine them with other expressions to build complex and interesting graph queries.

Note: If you're not familiar with graph concepts such as nodes and edges it may useful to first review the Wiki on Graph Theory.

This blog is part one of a three part series on making recommendations with Graph Expressions.

The three parts are:

1) Finding products that go well together.
2) Using the crowd to find products that go well with a user. 
3) Combining Graph Expressions to make a personalized recommendation.

Before diving into the first part of the recommendation lets consider the data. For all three blogs we'll be using a simple SolrCloud Collection called baskets in the following format:

userID     basketID   productID 
user1        basket1      productA      
user1        basket1      productA    
user1        basket2      productL      
user2        basket3      productD
user2        basket3      productM
...

The baskets collection holds all the products that have been added to baskets. Each record has a userID, basketID and productID. We'll be able to use Graph Expressions to mine this data for recommendations.

One more quick note before we get started. One of the main expressions we'll be using is the nodes expression. The nodes expression was originally released as the gatherNodes expression. Starting with Solr 6.4 the nodes function name can be used as a shorthand for gatherNodes. You can still also use gatherNodes if you like, they are both a pointer to the same function.

Now lets get started!

Finding Products That Go Well Together


One approach to recommending products is to start with a product the user has selected and find products that go well with that product.

The Graph Expression below finds the products that go well with productA:

scoreNodes(top(n=25,
                          sort="count(*) desc",
                          nodes(baskets,
                                     random(baskets, q="productID:productA", fl="basketID", rows="250"),
                                     walk="basketID->basketID",
                                     gather="productID",
                                     fq="-productID:productA",
                                     count(*))))

Let's explore how the expression works.

Seeding the Graph Expression


The inner random expression is used to seed the Graph Expression:

random(baskets, q="productID:productA", fl="basketID", rows="250")
                                   
The random expression is not a Graph Expression. But in this scenario its used to seed a Graph Expression with a set of root nodes to begin the traversal.

The random expression returns a pseudo random set of results that match the query. In this case the random expression is returning 250 basketsIDs that contain the productID productA.

The random expression serves two important purposes in seeding the Graph Expression:

1) It limits the scope of the graph traversal to 250 basketIDs. If we seed the graph traversal with all the basketIDs that have productA, we could potentially have a very large number of baskets to work with. This could cause a slow traversal and memory problems as Graph Expressions are tracked in memory.

2) It adds an element of surprise to the recommendation by providing a different set of baskets each time. This can result in different recommendations because each recommendation is seeded with a different set of basketIDs.


Calculating Market Basket Co-Occurrence with the Nodes Expression


Now lets explore the nodes expression which wraps the random expression. The nodes expression performs a breadth first graph traversal step, gathering nodes and aggregations along the way. For a full explanation of the nodes expression you can review the online documentation.

Lets look at exactly how the example nodes expression operates:

nodes(baskets, 
           random(baskets, q="productID:productA", fl="basketID", rows="250"),
           walk="basketID->basketID",
           fq="-productID:productA",
           gather="productID",
           count(*))

Here is an explanation of the parameters:
  1. baskets: This is the collection that the nodes expression is gathering data from.
  2. random expression: Seeds the nodes expression with a set of pseudo random basketIDs that contain productA.
  3. walk: Walks a relationship in the graph. The basketID->basketID construct tells the nodes expression to take the basketID in the tuples emitted by the random expression and search them against the basketID in the index.
  4. fq: Is a filter query that filters the results of the walk parameter. In this case it filters out records with productA in the productID field. This stops productA from being a recommendation for itself.
  5. gather: Specifies what field to collect from the rows that are returned by the walk parameter. In this case it is gathering the productID field.
  6. count(*): This is a graph aggregation, that counts the occurrences of what was gathered. In this case it counts how many times each productID was gathered. 

In plain english this nodes expression is gathering the productIDs that co-occur with productA in baskets, and counting how many times the products co-occur.

Scoring the Nodes To Find the Most Significant Product Relationships


With the output of the nodes expression we already know which products co-occur most frequently with productA. But there is something we don't know yet: how often the products occur across all the baskets. If a product occurs in a large percentage of baskets, then it doesn't have any particular relevance to productA.

This is where the scoreNodes function does it's magic.


scoreNodes(top(n=25,
                          sort="count(*) desc",
                          nodes(baskets,
                                     random(baskets, q="productID:productA", fl="basketID", rows="250"),
                                     walk="basketID->basketID",
                                     gather="productID",
                                     fq="-productID:productA",
                                     count(*))))

In expression above the top function emits the top 25 products based on the co-occurrence count. The top 25 products are then scored by the scoreNodes function.

The scoreNodes function scores the products based on the raw co-occurrence counts and their frequency across the entire collection.

The scoring formula is similar to the tf*idf scoring algorithm used to score results from a full text search. In the full text context tf (term frequency) is the number of times the term appears in the document. idf (inverse document frequency) is computed based on the document frequency of the term, or how many documents the term appears in. The idf is used to provide a boost to rarer terms.

scoreNodes uses the same principal to score nodes in a graph traversal. The count(*) aggregation is used as the tf value in the formula. The idf is computed for each node, in this case productID,  based on global statistics across the entire collection. The effect of the scoreNodes algorithm is to provide a boost to nodes that are rarer in the collection.

The scoreNodes functions adds a field to each node tuple called nodeScore, which is the relevance score for the node.

Now we know which products have the most significant relationship with productA.


Can We Still Do Better?


Yes. We now know which products have the most significant relationship with productA. But we don't know if the user will have an interest in the product(s) we're recommending. In the next blog in the series we'll explore a graph expression that uses connections in the graph to personalize the recommendation.

Statistical programming with Solr Streaming Expressions

In the previous blog we explored the new timeseries function and introduced the syntax for math expressions . In this blog we'll dive d...