Sunday, March 12, 2017

Solr 6.5: Retrieve and rank with graph expressions

This blog describes how to retrieve and rank documents with graph expressions. First let's define exactly what it means to retrieve and rank with a graph expression and then we'll walk through an example.

The retrieve step is a relevance ranked search. The rank step re-ranks the top N documents based on the results of a graph expression.

Why would we want to do this? I think its easiest to explain this with an example.

Re-Ranking Based On A Users "Work Graph"


Before diving into the example, it's important to understand that this re-ranking strategy is designed to provide sub-second response times. It's also designed to adapt in real-time as users use the system and work graphs are updated.

Ok, let's dive into the example.

In this example when users perform a search the top N results are re-ranked to boost documents that are part of their work graph. To find a users work graph, a graph expression is used to mine usage logs in real time to find documents that are closely related to the users work.

This relevance strategy can be useful for systems where users are working with documents and performing searches to find documents. One example of this type of system is Alfresco, which provides an Enterprise Content Management system, that uses Solr for search. Alfresco logs when users read and edit documents. These logs can then be mined with graph expressions to discover users work graphs.


The Re-Rank Expression


The re-rank expression looks like this:

top(n=50,
      sort="rescore desc",
      select(id,
                if(eq(nodeScore, null), score, mult(score, log(nodeScore))) as rescore,
                outerHashJoin(${search}, hashed=${graph}, on="id=node")))

Notice the outerHashJoin refers to ${search} and ${graph} variables. This is using Solr's built in macro expansion capability. ${search} and ${graph} are referring to http parameters that point to the search and graph Streaming Expressions. This is a great way to break up long Streaming Expressions into manageable pieces and also create re-usable parameterized templates.

We'll first explore the re-rank expression above, then we'll look at the ${search} and ${graph} expressions.

Let's start pulling apart the re-rank expression by looking at the outerHashJoin expression. The outerHashJoin expression joins two expressions ${search} and ${graph}. The join keys are the id field from the ${search} tuples and node field from the ${graph} tuples.

The outerHashJoin emits all tuples from the ${search} expression whether there is a matching tuple from the ${graph} expression or not. If there is a match found from the ${graph} expression then it's fields are added to the matching ${search} tuple.

We'll look at the specifics of the ${search} and ${graph} expression below, but at a high level they are:

1) search: A full text search result.
2) graph: The documents that are closely related to the users work a.k.a. the users work graph.

Let's move on to the select expression that is wrapping the outerHashJoin. The select function selects specific fields from tuples and performs field level transformations on tuples. These field level operations known as Evaluators were significantly expanded by Dennis Gove in Solr 6.5.

In the example, the select function operates over each tuple emitted by the outerHashJoin. It emits the id field for every tuple and a new derived field called rescore. 

The rescore field is derived from a specific formula in red below:

if(eq(nodeScore, null), score, mult(score, log(nodeScore))) as rescore

This formula is expressed using the new Evaluators. Translated into plain english the formula is:

if the nodeScore field is null, then use the score field.
else
multiply the score field by the natural log of the nodeScore field.

The nodeScore field is assigned to documents emitted by the ${graph} expression. It describes how relevant the document is to the users work graph.

The score is the score assigned to documents by the ${search} expression. It describes how relevant the document is to the full text search.

Notice in the formula that the score is always present. But the nodeScore can be null. This is because only documents in the search result that are in the users work graph will have a nodeScore assigned during the outer join.

Also notice the tuples that contain a nodeScore are boosted by multiplying the log of the nodeScore and the score. The documents that don't have a nodeScore don't receive this boost. This boosts documents that are part of the users work graph.

In the final step the top expression emits the top 50 tuples sorted by rescore desc. This is the re-ranked result set.

We spent quite a bit of time going through the re-rank expression, so let's spend a little time on the ${search} and ${graph} expressions.


The Search Expression


In this example we'll use a very simple search expression that looks like this:

search(content,
           q="natural gas",
           fl="id, score",
           rows="100",
           sort="score desc")

This expression searches the content collection in the default field for the terms natural gas. The expression will return the id and score fields and sort by score descending. The rows parameter is set to 100 which means it will fetch 100 rows from each shard, rather the 100 rows total. So if there are 4 shards this will return up to 400 results.

The search expression is really designed to provide input to other streaming expressions, so it simply merges the results from the shards into a single stream and maintains the sort order.

The Graph Expression


The graph expression is designed to query usage logs to return documents that are part of a users work graph.

Here is the graph expression we will be using for this example:

scoreNodes(nodes(logs,
                               top(n=20,
                                       sort="count(*) desc",
                                       nodes(logs,
                                                  nodes(logs,
                                                             walk="joel->userID",
                                                             gather="contentID"),
                                                  walk="node->contentID",
                                                  gather="userID",
                                                  count(*))),
                               walk="node->userID",
                               gather="contentID",
                               count(*)))    


Working our way outwards from the innermost nodes expression (Note that nodes is an alias for the gatherNodes expression):

1) The innermost nodes expression gathers all contentID's from the logs where the userID is joel.
2) Working outwards, the next nodes expression takes all the contentID's emitted from step 1 and gathers all the userID's that have viewed these contentID's. It also counts how many contentID's each user has viewed that joel has viewed.
3) The top expression emits the top 20 users that have viewed the most overlapping content with joel.
4) The outermost nodes expression gathers all the contentID's viewed by the users emitted in step 3.
5) The scoreNodes expression scores all the contentID's emitted by step 4. This adds the nodeScore field to the tuples which describes how relevant each contentID is to the users work graph.

This graph expression will emit all the contentID's in the users work graph. The contentID in each tuple will be in the node field. This is why the outerHashJoin in the re-rank expression is joining the id field in the ${search} expression to the node field in the ${graph} expression.

Solr temporal graph queries for event correlation, root cause analysis and temporal anomaly detection

Temporal graph queries will be available in the 8.9 release of Apache Solr. Temporal graph queries are designed for key log analytics use ...