Thursday, February 16, 2017

Solr's Shiny New Apache Calcite SQL Integration

Solr's new Apache Calcite SQL integration is in Lucene/Solr's master branch now. This blog will discuss Solr's potential as a distributed SQL engine, some thoughts on Apache Calcite, what's currently supported with Solr's Apache Calcite integration and what might be coming next.

Solr as Distributed SQL Engine

The initial SQL interface, which released with Solr 6.0, uses the Presto project's SQL parser to parse the SQL and then rewrites the queries as Solr Streaming Expressions.

The goals of the initial SQL release were focused around supporting SQL aggregations using both MapReduce and Solr's native faceting capabilities. And also to support Solr's full text query language in the SQL predicate.

Even with the limited set of goals it became clear that Solr could be a special SQL engine. There are very few distributed SQL engines that can push down so much processing into the engine, and also rise up above the engine when needed to perform streaming parallel relational algebra. And then of course there is the predicate. Few existing SQL engines can compete with Solr's rich search predicates which have been developed over a period of 10+ years.

Last but not least is the performance. Solr is not a batch engine adapted for request/response. Solr is a request/response engine from the ground up. Solr's search performance, analytic performance and instant streaming capabilities make it one of the fastest distributed SQL engines available.

The Apache Calcite Integration

Kevin Risden broke ground on Solr's Apache Calcite integration in March of 2016. It wasn't clear at that time that Apache Calcite would be the right fit, but over time it became clear that it was exactly what we needed. There were really two choices for how we could implement the next phase of Solr's SQL integration:

1) Stick with the approach of using a SQL Parser only, and work directly with the parse tree to build the physical query plan in Streaming Expressions.

2) Use a broader framework and plugin rules that would push down parts of the SQL query that we wanted to control.

There were pros and cons to both approaches. The main pro for using just a SQL parser is that we would have total control over the process once the query was parsed. This means we would never run into a scenario where we couldn't implement something that leveraged Solr's capabilities.

The main con to just using the SQL parser is that we would have been responsible for implementing everything, including things like a complete JDBC driver. This is not a small undertaking.

The main pros and cons of using a broader framework were exactly the opposite: less control but the ability to leverage existing features in the framework.

Kevin was very much in favor of using a broader framework, but I was not convinced that we could take full advantage of Solr's capabilities unless we controlled everything.

But in the end Kevin broke ground embedding the Apache Calcite framework into Solr. In the open source world, working code tends to win out. Based on his initial work, I agreed that we should move forward using the wider Apache Calcite framework.

Kevin continued working on the integration. Along the way Cao Manh Dat joined in and added the aggregation support to the branch that Kevin was working on.

Eventually I joined in as well, building on top of the work that Kevin and Dat already contributed. My main focus was to ensure that the initial Apache Calcite implementation was comparable in features and performance to Solr's existing SQL integration.

As I spent more time working with Apache Calcite I came to really appreciate what the project offered. Apache Calcite allows you to selectively push down parts of the SQL implementation such as the predicate, sort, aggregation and joins. It gives you almost full control if you want it, but allows you to leverage any part of the framework that you choose to use. For example you can push down nothing if you want, or you can push down just the predicate or just the sort.

Apache Calcite also provides two very important things: a cost based query optimizer and a JDBC driver. Solr's initial JDBC driver only implemented part of the specification and the specification is large. Implementing a cost based query optimizer is a daunting task. With Apache Calcite we get these features almost for free. We still have to provide hooks into them, but we don't have to implement them.

What's Currently Supported in Lucene/Solr Master

The initial Apache Calcite integration includes:

  • limited and unlimited selects. Unlimited selects stream the entire result set regardless of the size.
  • Support for Solr search predicates including support for embedding an entire Solr query using the "_query_" field. This means all Solr query syntax is supported including complex full text, graph query, geo-spatial, fuzzy, moreLikeThis etc.
  • Support for score in the field list and order by in queries that have a limit clause.
  • Support for field aliases.
  • Support for faceted and MapReduce aggregations with the aggregationMode parameter.
  • Support for aggregations on multi-valued fields when in facet mode.
  • Support for multi-value fields in simple selects.
  • Parallel execution of MapReduce aggregations on worker nodes.
  • Support for aggregations without a group by clause. These are always pushed down into the search engine.
  • Support for select distinct queries in both faceted and MapReduce mode.
  • Support for group by aggregations in both faceted and MapReduce mode.
  • Support for sorting on fields in the index as well as sorting on aggregations.
  • Support for the having clause. In MapReduce mode the having clause is pushed down to the worker nodes now.

What's Coming Next

Now that the Apache Calcite integration is in master we are free to begin adding new features on a regular basis. Here are some features that are on the top of the list:

1) Support for * in the field list.
2) Automatic selection of aggregationMode (facet or MapReduce). Having Solr choose the right aggregation mode based on the cardinality of fields being aggregated.
3) Support for SELECT ... INTO ...
4) Support for arithmetic operations on fields and aggregations (select (a*b) as c from t). This is now supported in Streaming Expressions (SOLR-9916).
5) Expanded aggregation support.
5) Support for UNION, INTERSECT, JOIN, using Streaming Expressions parallel relational algebra capabilities and Apache Calcites query optimizer.

New York - Coronavirus Statistics (NYTimes Data Set)

As of 2020-04-09 New York City - Cumulative Cases By Day New York City - Cumulative Deaths By Day ...