Thursday, January 21, 2021

Optimizations Coming to Solr

 

Starting in Solr 8.8 and continuing into Solr 9.0 there are a number of optimizations to be aware of that provide breakthroughs in performance for important use cases.


Optimized Self Join (https://issues.apache.org/jira/browse/SOLR-15049)


This optimization is a breakthrough in performance for document level access control. It is by far the fastest filter join implementation in the Solr project and is likely the fastest access control join available for search engines. 


This optimization requires that the joined documents reside in the same Solr core and that the join key field be the same for both sides of the join. For access control this means the access control records must be in the same core as the main document corpus and the join from access control records to the main documents must use the same field.  


The optimization in this scenario is significant. The performance of the join allows it to scale to much larger joins. For example joins that involve upwards of 500,000 join keys can be executed with sub-second performance. In an access control setting this translates to 500,000+ access control groups which can be used to filter the main document set, with sub-second performance. That represents more than a 10x increase in join size over the next fastest Solr filter join which can perform joins with up to 50,000 access control groups and achieve sub-second performance. 


A followup blog will discuss the technical details of this optimization and how it can be implemented in a sharded Solr Cloud installation. 

Tiered Aggregations (https://issues.apache.org/jira/browse/SOLR-15036)


This optimization is a breakthrough for large scale aggregations. A typical Solr aggregation using JSON facets has one aggregator node. In this scenario aggregations from all the shards are collected in one place and merged. This technique has limits in scalability because eventually the number of threads used to contact the shards and amount of time and memory it takes to perform the merge is prohibitive.


Tiered aggregations eliminates the single aggregator bottleneck by setting up a tier of aggregator nodes. Each aggregator node performs a JSON facet aggregation for a subset of shards. Then a single top level aggregator node merges the aggregations from the tier of aggregator nodes.  The partitioning of the middle tier of aggregator nodes happens automatically when aggregating over a Solr Cloud alias which points to multiple collections. In this scenario an aggregator node is assigned to each collection in the alias. 


Tiered aggregation allows for real-time aggregations over very large clusters. For example: 200 aggregator nodes each calling 200 shards is a realistic scenario, providing real time aggregations across 40,000 shards. 


Improved export sorting performance and efficient high cardinality aggregations (https://issues.apache.org/jira/browse/SOLR-14608)


Both Solr and Elasticsearch have traditionally not been effective high cardinality aggregation engines.


Solr’s export handler has undergone a series of performance improvements culminating with a new technique for sorting that improves the throughput of some export queries by 100%. The improved sorting performance is part of a set of changes designed to support a performant and efficient high cardinality aggregation engine.


High cardinality aggregation often occurs in data warehousing due to multi-dimensional aggregations that result in a high number of dimensional combinations. Traditional search engine approaches to aggregations do not work well in high cardinality use cases.


Traditional faceted aggregation is not well suited for high cardinality aggregation because it tracks the full aggregation in memory at once. When performing high cardinality aggregation it’s often not practical to track all dimensions in memory. Solr's export handler solves this by first sorting the result set on the aggregation dimensions and then rolling up aggregations one group at a time. Using this technique high cardinality aggregations can be accomplished using a small amount of memory.


The export handler also now has the capability of running a Streaming Expression in memory over the sorted result set. This means high cardinality aggregations can be done inside the export handler allowing the export handler to return aggregated results. This can greatly reduce the amount of data that needs to be sent across the network to aggregate over the sorted/exported results. 


Spark-Solr aggregations will also benefit from the improved performance of the export handler because Spark-Solr uses the export handler to return results for aggregations.

Improved collapse performance/efficiency, block collapse (https://issues.apache.org/jira/browse/SOLR-15079)


Solr’s collapse feature is often used for larger e-commerce catalogs with a high number of products that don’t perform well with Lucene/Solr grouping. Solr’s collapse can now take advantage of block indexing/nested documents to significantly improve query performance, cutting search times in half in some scenarios,  while decreasing the memory used by collapse by 99%. In order to take advantage of this feature catalogs will need to be indexed such that all SKU’s that share the same product ID are indexed in the same block.


The improved performance and efficiency will allow for more scalability (higher QPS with less hardware) and provide faster response times for ecommerce search applications. The improved performance also leaves more time to improve relevance with advanced ranking algorithms. 


Solr temporal graph queries for event correlation, root cause analysis and temporal anomaly detection

Temporal graph queries will be available in the 8.9 release of Apache Solr. Temporal graph queries are designed for key log analytics use ...