Wednesday, December 16, 2015

Understanding Solr's AnalyticsQuery

Back in Solr 4.9 an AnalyticsQuery API was added that allows developers to plug custom analytic logic into Solr. The AnalyticsQuery class provides a clean and simple API that gives developers access to all the rich functionality in Lucene and is strategically placed within Solr’s distributed search architecture. Using this API you can harness all the power of Lucene and Solr to write custom analytic logic.

Let’s dive in!

The AnalyticsQuery API is a natural extension of the PostFilter API. So let’s start with PostFilters and then we’ll move quickly to the AnalyticsQuery API.

PostFilters

Lucene has the concept of a Collector. A Collector collects the results that match a search.

The PostFilter API wraps a DelegatingCollector around a ranking collector to filter results. The DelegatingCollector is passed each search result, applies filtering logic to the result and then “delegates” the result to the ranking Collector, if it chooses.

DelegatingCollectors can wrap other DelegatingCollectors to form a pipeline. Each Delegating collector looks at the search result and delegates to the next collector. The order that DelegatingCollectors will be executed is controlled by a “cost” parameter that is built into Solr’s local parameter query syntax.

This was put in so that costly PostFilters could be run after less costly PostFilters, thus reducing the number of documents seen by the more costly PostFilter.

DelegatingCollectors also have a finish() method that notifies the collector that all documents have been collected.

The AnalyticsQuery API

The PostFilter API is very strategically positioned because it sees every search result. The AnalyticsQuery API capitilizes on that strategic positioning by adding a few key features.
  • It provides easy access to the ResponseBuilder object, so analytics data can be easily output and piped between AnalyticsQueries. 
  • It provides a way to insert a custom MergeStrategy, so that analytic output from the shards can be merged. 

Extending the AnalyticsQuery class

To hook into the AnalyticsQuery API you need to extend the AnalyticsQuery class. This class has two constructors:

public AnalyticsQuery();

Use the above constructor if you’re doing analytics on a single Solr server.

public AnalyticsQuery(MergeStrategy mergeStrategy);

Use the above constructor if you’re doing distributed analytics. If you use this constructor you’ll need to also write a MergeStrategy implementation. In my next blog I’ll cover the MergeStrategy API, but for now it’s enough to know that it exists, and what it’s used for.

Then you have to implement the hook method:

public abstract DelegatingCollector getAnalyticsCollector(ResponseBuilder rb, IndexSearcher searcher); 

This is where you return your DelegatingCollector implementation that will collect the analytics. Notice that you are passed the current IndexSearcher and ResponseBuilder. You can access both the request context and the Solr response objects through the response builder.

Through the request context you can pipeline the result of one AnalyticsQuery to another AnalyticsQuery, controlling the order of execution using the “cost” parameter. You can place your analytic output directly into the Solr response through the response object in the ResponseBuilder.

Below is a very simple class that extends AnalyticsQuery

 public class MyAnalyticsQuery extends AnalyticsQuery {

    public MyAnalyticsQuery(MergeStrategy mergeStrategy)
    {
      super(mergeStrategy);
    }

    public DelegatingCollector getAnalyticsCollector(ResponseBuilder rb, IndexSearcher searcher)
    {
      return new MyAnalyticsCollector(rb);
    }
  }

And here is a very simple implementation of a DelegatingCollector (MyAnalyticsCollector):

 class MyAnalyticsCollector extends DelegatingCollector {
    ResponseBuilder rb;
    int count;

    public MyAnalyticsCollector(ResponseBuilder rb) {
      this.rb = rb;
    }

    public void collect(int doc) throws IOException {
      ++count;
      leafDelegate.collect(doc);
    }

    public void finish() throws IOException {
      NamedList analytics = new NamedList();
      rb.rsp.add("analytics", analytics);
      analytics.add("mycount", count);
      if(this.delegate instanceof DelegatingCollector) {
        ((DelegatingCollector)this.delegate).finish();
      }
    }
  }
Notice that in the collect(int doc) method, it is incrementing a counter and then calling the collect method on it’s delegate. The collect method is called for each document in the result set. In the finish() method the count is placed into the output response. See the full API for the DelegatingCollector class to understand how it works.

Plugging In Your Code

Just like PostFilters, an AnalyticsQuery is passed in through a custom QParserPlugin as a filter query. Here is an example AnalyticsQuery being passed in through the “fq” parameter:

q=*:*&fq={!myanalytic param1=a param2=b cost=101}&fq={!myanalytic2 param1=a param2=b cost=102}

In the local parameter syntax above there are two filter queries pointing to custom QParserPlugins. The local parameter syntax “!myanalytic” tells Solr to load the “myanalytic” QParserPlugin, which will return a Query that extends AnalyticsQuery.

Notice the “cost” parameters which will control the order of execution. In this case the “myanalytic” AnalyticsQuery will execute ahead of the “myanalytic2″ AnalyticsQuery.

Interaction with the CollapsingQParserPlugin

The CollapsingQParserPlugin is a PostFilter that performs field collapsing. Because it’s a PostFilter it can be layered in among AnalyticsQueries using the “cost” parameter to control order of execution. So, you can conveniently collect analytics before and after field collapsing.

Streaming Expression's Powerful New Data Structures

In the next release of Solr, the Streaming Expression library includes two powerful new data structures called list and cell.  In this blog...