Sunday, April 23, 2017

Streaming Expression's Powerful New Data Structures

In the next release of Solr, the Streaming Expression library includes two powerful new data structures called list and cell. In this blog we'll first explore the data structures individually and then explore the exciting expressions we can build when the data structures are combined.

List


The list expression holds a list of Streaming Expressions. List has the following syntax:

list(expr, expr, expr ...)

The list expression emits the tuples from each expression in the list sequentially. So basically it is concatenating streams.

The example below shows a list of echo expressions:

list(echo("one"),
      echo("two"),
      echo("three"))

In the expression above each echo expression returns a single tuple that echo's its text parameter. The list expression emits the tuples as a single stream.

If you send this expression to Solr the response would look like this:

{ "result-set": { "docs": [ { "echo": "one" }, { "echo": "two" }, { "echo": "three" }, { "EOF": true, "RESPONSE_TIME": 0 } ] } }


The list expression doesn't actually hold any data itself. It simply emits the tuples from the underlying streams. This means that the list expression starts streaming results as soon as the first expression begins streaming results.


Cell


The cell expression flattens a stream and emits it in a single tuple. The cell expression has the following syntax:

cell(name, expr)

The cell expression emits a single tuple with a single key/value pair. The key is the name parameter. The cell expression gathers up all the tuples from the Streaming Expression parameter and adds them to a list. The list of tuples is the value of the pair.

Here is an example:

cell(cell1,
      list(echo("one"),
            echo("two"),
            echo("three")))

Note that you could swap out the list expression in the example with any Streaming Expression (search, facet, stats, topic, nodes etc...).

If you send this expression to Solr the response looks like this:

{ "result-set": { "docs": [ { "cell1": [ { "echo": "one" }, { "echo": "two" }, { "echo": "three" } ] }, { "EOF": true, "RESPONSE_TIME": 0 } ] } }

Notice now that the output from the list of echos has been gathered into a JSON array and is pointed to by the cell1 attribute.

List of Cells


Now let's explore what we can do with a list of cells. For this example we'll move away from simple echoes into a more real world scenario.

Consider the following syntax:

list(cell(query1, search(...)),
      cell(query2, search(...)),
      cell(graph,  gatherNodes(...)),
      cell(facet1,  facet(...)),
      cell(facet2, facet(...)),
      cell(stats1, list(stats(...),
                               stats(...),
                               stats(...))),
      cell(recommend, significantTerms(...)))

Wow, what is going on here? Well something pretty exciting...

The expression above is performing multiple searches, a graph expression, multiple facet and stats expressions and a significantTerms expression. The results of each of these expressions will be nicely separated into lists of tuples which can be accessed by their named attribute.

Each of these expressions can have different queries and access different collections. Other expressions can access databases using the jdbc expression. And custom expressions can be added that stream from other data sources.

Note that you would want to use expressions that return bounded result sets with this approach. For example a search expression can be used to return the top N search results.


Streaming Response


In the example above the list expression will move sequentially through each cell. This means that data begins streaming as soon as the first cell returns tuples.

This is very different then a standard Solr response that gathers docs, facets, stats etc... and only sends the response when all the data is collected.

The effect of this is that data will start flowing much faster with the streaming approach. Total throughput time will be slower though because the normal Solr query path maximizes throughput.

But if the goal is to return something to the user as fast as possible the Streaming approach works better.

   

Exploring Solr's New Time Series and Math Expressions

In Solr 6.6 the Streaming Expression library has added support for time series and math expressions . This blog will walk through an exampl...