Thursday, March 30, 2017

Streaming NLP is coming in Solr 6.6

Solr 6.5 is out now, so it's time to start thinking about the next release. One of the interesting features coming in Solr 6.6 is Streaming NLP. This exciting new feature is already committed and waiting for release. This blog will describe how Streaming NLP works.

The analyze Stream Evaluator

One of the features added in Solr 6.5 was Stream Evaluators. Stream Evaluators perform operations on Tuples in the stream. There are already a rich set of math and boolean Stream Evaluators in Solr 6.5 and more coming in Solr 6.6. The math and boolean Stream Evaluators allow you to build complex boolean logic and mathematical formulas on Tuples in the stream.

Solr 6.6 also has a new Stream Evaluator, called analyze, that works with text. The analyze evaluator applies a Lucene/Solr analyzer to a text field in the Tuples and returns a list of tokens produced by the analyzer. The tokens can then by used to annotate Tuples or streamed out as Tuples. We'll show examples of both approaches later in the blog.

But it's useful to talk about the power behind Lucene/Solr analyzers first. Lucene/Solr has a large set of analyzers that tokenize different languages and apply filters that transform the token stream. The "analyzer chain" design allows you to chain tokenizers and filters together to perform very powerful text transformations and extractions.

The analysis chain also provides a pluggable API for adding new NLP tokenizers and filters to Solr. New tokenizers and filters can be added and then layered with existing tokenizers and filters in interesting ways. New NLP analysis chains can then be used both during indexing and with Streaming NLP.

The cartesianProduct Streaming Expression

The cartesianProduct Streaming Expression is also new in Solr 6.6. The cartesianProduct expression emits a stream of Tuples from a single Tuple by creating a cartesian product from a multi-valued field or a text field. The analyze Stream Evaluator is used with the cartesianProduct Streaming Expression to create a cartesian product from a text field.

Here is a very simple example:

For this example we have indexed a single record in Solr with an id and text field called body:

id: 1
body: "c d e f g"

The following expression will create a cartesian product from this Tuple:

cartesianProduct(search(collection, q="id:1", fl="id, body", sort="id desc"),
                              analyze(body, analyzerField) as outField)

First let's look at what this expression is doing then look at the output.

The cartesianProduct expression is wrapping a search expression and an analyze Stream Evaluator. The cartesianProduct expression reads the Tuples returned by the search expression and applies the analyze Stream Evaluator to each Tuple. (Note that the cartesianProduct expression can read Tuples from any Streaming Expression.)

The analyze Stream Evaluator is taking the text from the body field in the Tuple and is applying an analyzer found in the schema which is pointed to by the analyzerField parameter.

The cartesianProduct function emits a single Tuple for each token produced by the analyzer. For example if we have a basic white space tokenizing analyzer the Tuples emitted would be:

id: 1
outField: c

id: 1
outField: d

id: 1
outField: e

id: 1
outField: f

id: 1
outField: g

Creating Entity Graphs

The Tuples emitted by the cartesianProduct and the analyze evaluator can be saved to another Solr Cloud collection with the update stream. This allows you to build graphs from extracted entities that can then be walked with Solr Graph Expressions.

Annotating Tuples

The analyze Stream Evaluator can also be used with the select Streaming Expression to annotate Tuples with tokens extracted by an analyzer. Here is the sample syntax:

select(search(collection, q="id:1", fl="id, body", sort="id desc"),
          analyze(body, analyzerField) as outField)

This will add a field to each Tuple which will contain the list of Tuples extracted by the analyzer. The update function can be used to save the annotated Tuples to another Solr Cloud collection.

Scaling Up

Solr's parallel batch and executor framework can be used to apply a massive amount of computing power to perform NLP on extremely large data sets. You can read about the parallel batch and the executor framework in these blogs:

Sunday, March 12, 2017

Solr 6.5: Retrieve and rank with graph expressions

This blog describes how to retrieve and rank documents with graph expressions. First let's define exactly what it means to retrieve and rank with a graph expression and then we'll walk through an example.

The retrieve step is a relevance ranked search. The rank step re-ranks the top N documents based on the results of a graph expression.

Why would we want to do this? I think its easiest to explain this with an example.

Re-Ranking Based On A Users "Work Graph"

Before diving into the example, it's important to understand that this re-ranking strategy is designed to provide sub-second response times. It's also designed to adapt in real-time as users use the system and work graphs are updated.

Ok, let's dive into the example.

In this example when users perform a search the top N results are re-ranked to boost documents that are part of their work graph. To find a users work graph, a graph expression is used to mine usage logs in real time to find documents that are closely related to the users work.

This relevance strategy can be useful for systems where users are working with documents and performing searches to find documents. One example of this type of system is Alfresco, which provides an Enterprise Content Management system, that uses Solr for search. Alfresco logs when users read and edit documents. These logs can then be mined with graph expressions to discover users work graphs.

The Re-Rank Expression

The re-rank expression looks like this:

      sort="rescore desc",
                if(eq(nodeScore, null), score, mult(score, log(nodeScore))) as rescore,
                outerHashJoin(${search}, hashed=${graph}, on="id=node")))

Notice the outerHashJoin refers to ${search} and ${graph} variables. This is using Solr's built in macro expansion capability. ${search} and ${graph} are referring to http parameters that point to the search and graph Streaming Expressions. This is a great way to break up long Streaming Expressions into manageable pieces and also create re-usable parameterized templates.

We'll first explore the re-rank expression above, then we'll look at the ${search} and ${graph} expressions.

Let's start pulling apart the re-rank expression by looking at the outerHashJoin expression. The outerHashJoin expression joins two expressions ${search} and ${graph}. The join keys are the id field from the ${search} tuples and node field from the ${graph} tuples.

The outerHashJoin emits all tuples from the ${search} expression whether there is a matching tuple from the ${graph} expression or not. If there is a match found from the ${graph} expression then it's fields are added to the matching ${search} tuple.

We'll look at the specifics of the ${search} and ${graph} expression below, but at a high level they are:

1) search: A full text search result.
2) graph: The documents that are closely related to the users work a.k.a. the users work graph.

Let's move on to the select expression that is wrapping the outerHashJoin. The select function selects specific fields from tuples and performs field level transformations on tuples. These field level operations known as Evaluators were significantly expanded by Dennis Gove in Solr 6.5.

In the example, the select function operates over each tuple emitted by the outerHashJoin. It emits the id field for every tuple and a new derived field called rescore. 

The rescore field is derived from a specific formula in red below:

if(eq(nodeScore, null), score, mult(score, log(nodeScore))) as rescore

This formula is expressed using the new Evaluators. Translated into plain english the formula is:

if the nodeScore field is null, then use the score field.
multiply the score field by the natural log of the nodeScore field.

The nodeScore field is assigned to documents emitted by the ${graph} expression. It describes how relevant the document is to the users work graph.

The score is the score assigned to documents by the ${search} expression. It describes how relevant the document is to the full text search.

Notice in the formula that the score is always present. But the nodeScore can be null. This is because only documents in the search result that are in the users work graph will have a nodeScore assigned during the outer join.

Also notice the tuples that contain a nodeScore are boosted by multiplying the log of the nodeScore and the score. The documents that don't have a nodeScore don't receive this boost. This boosts documents that are part of the users work graph.

In the final step the top expression emits the top 50 tuples sorted by rescore desc. This is the re-ranked result set.

We spent quite a bit of time going through the re-rank expression, so let's spend a little time on the ${search} and ${graph} expressions.

The Search Expression

In this example we'll use a very simple search expression that looks like this:

           q="natural gas",
           fl="id, score",
           sort="score desc")

This expression searches the content collection in the default field for the terms natural gas. The expression will return the id and score fields and sort by score descending. The rows parameter is set to 100 which means it will fetch 100 rows from each shard, rather the 100 rows total. So if there are 4 shards this will return up to 400 results.

The search expression is really designed to provide input to other streaming expressions, so it simply merges the results from the shards into a single stream and maintains the sort order.

The Graph Expression

The graph expression is designed to query usage logs to return documents that are part of a users work graph.

Here is the graph expression we will be using for this example:

                                       sort="count(*) desc",

Working our way outwards from the innermost nodes expression (Note that nodes is an alias for the gatherNodes expression):

1) The innermost nodes expression gathers all contentID's from the logs where the userID is joel.
2) Working outwards, the next nodes expression takes all the contentID's emitted from step 1 and gathers all the userID's that have viewed these contentID's. It also counts how many contentID's each user has viewed that joel has viewed.
3) The top expression emits the top 20 users that have viewed the most overlapping content with joel.
4) The outermost nodes expression gathers all the contentID's viewed by the users emitted in step 3.
5) The scoreNodes expression scores all the contentID's emitted by step 4. This adds the nodeScore field to the tuples which describes how relevant each contentID is to the users work graph.

This graph expression will emit all the contentID's in the users work graph. The contentID in each tuple will be in the node field. This is why the outerHashJoin in the re-rank expression is joining the id field in the ${search} expression to the node field in the ${graph} expression.

Feature Scaling with Solr Streaming Expressions

Before performing machine learning operations its often important to scale the feature vectors so they can be compared at the same scale. In...