Solr Analytics: Parallel Computing with SolrCloud

Coming in Solr 5.1 is a new general purpose parallel computing framework for SolrCloud. This blog introduces the main concepts of the parallel computing framework. Followup blogs will cover these concepts in greater detail.

The Framework

The new parallel computing framework has three main components:

Shuffling
Worker Collections
Streaming API (Solrj.io)

An overview of these components is covered below. Follow-up blogs will cover each of these topics in detail.

Shuffling

Shuffling will be a familiar concept for people who have worked with other parallel computing frameworks such as Hadoop. In general terms, shuffling involves the sorting and partitioning of records. Sorting and partitioning provides the foundation for many parallel computing activities.

Starting with Solr 5.1 SolrCloud has the ability to shuffle entire result sets. Basically this means sorting and partitioning entire result sets.

The efficient sorting of entire result sets is handled by Solr's "/export" handler. The /export handler uses a stream sorting technique that efficiently sorts and streams very large result sets.

The partitioning of result sets is handled by a new Solr query, called a HashQuery, that hash partitions search results based on arbitrary fields in the documents.

Worker Collections

Solr 5.1 introduces the concept of Worker Collections. A Worker Collection can be any SolrCloud collection of any size. It can contain search indexes or it can exist only to perform work within the parallel computing framework. Worker Collections are created and managed using SolrCloud's standard Collections API.

Each node in a Worker Collection has a new request handler called a Stream Handler. The Stream Handlers, on each worker node, perform operations on partitioned result sets in parallel.

The operations that worker nodes perform in parallel are defined by the Streaming API.

Streaming API (Solrj.io)

The Streaming API is a new Java API located in the solrj.io package. The Streaming API abstracts Solr search results as streams of Tuples called TupleStreams. A Tuple is a thin wrapper for a Map with key/value pairs that represent a search result.

The TupleStream interface defines a simple API for opening, reading and closing streams of search results.

Two of the core TupleStreams are the SolrStream and the CloudSolrStream. The SolrStream class abstracts a stream of search results from a single Solr instance. The CloudSolrStream class abstracts a stream of search results from a SolrCloud collection.

TupleStream's can be wrapped or decorated by other TupleStreams. Decorator streams perform operations on their underlying streams. Streaming operations typically fall into one of two categories:

Streaming Transformation: These types of operations transform the underlying stream(s). Examples of streaming transformations include: unique, group by, rollup, union, intersect, complement, join etc...)

Streaming Aggregation: These types of operations gather metrics and build aggregations on the underlying streams. These types of operations include: sum, count, average, min, max etc...)

A core set of TupleStream decorators come with the Streaming API and developers can create their own decorator implementations that perform customized streaming operations.

The Streaming API also includes a ParallelStream implementation. The ParallelStream wraps a TupleStream and pushes it to a Worker Collection to be executed in parallel. Using partition keys TupleStreams can be partitioned evenly across worker nodes. This allows the Streaming API to perform parallel partitioned relational algebra on TupleStreams.

For a deeper dive into the Solrj.io package you can read the next blog in the series.

Solr Analytics

Sunday, March 29, 2015

Parallel Computing with SolrCloud

Solr temporal graph queries for event correlation, root cause analysis and temporal anomaly detection

Search This Blog