Sunday, April 23, 2017

Streaming Expression's Powerful New Data Structures

In the next release of Solr, the Streaming Expression library includes two powerful new data structures called list and cell. In this blog we'll first explore the data structures individually and then explore the exciting expressions we can build when the data structures are combined.

List


The list expression holds a list of Streaming Expressions. List has the following syntax:

list(expr, expr, expr ...)

The list expression emits the tuples from each expression in the list sequentially. So basically it is concatenating streams.

The example below shows a list of echo expressions:

list(echo("one"),
      echo("two"),
      echo("three"))

In the expression above each echo expression returns a single tuple that echo's its text parameter. The list expression emits the tuples as a single stream.

If you send this expression to Solr the response would look like this:

{ "result-set": { "docs": [ { "echo": "one" }, { "echo": "two" }, { "echo": "three" }, { "EOF": true, "RESPONSE_TIME": 0 } ] } }


The list expression doesn't actually hold any data itself. It simply emits the tuples from the underlying streams. This means that the list expression starts streaming results as soon as the first expression begins streaming results.


Cell


The cell expression flattens a stream and emits it in a single tuple. The cell expression has the following syntax:

cell(name, expr)

The cell expression emits a single tuple with a single key/value pair. The key is the name parameter. The cell expression gathers up all the tuples from the Streaming Expression parameter and adds them to a list. The list of tuples is the value of the pair.

Here is an example:

cell(cell1,
      list(echo("one"),
            echo("two"),
            echo("three")))

Note that you could swap out the list expression in the example with any Streaming Expression (search, facet, stats, topic, nodes etc...).

If you send this expression to Solr the response looks like this:

{ "result-set": { "docs": [ { "cell1": [ { "echo": "one" }, { "echo": "two" }, { "echo": "three" } ] }, { "EOF": true, "RESPONSE_TIME": 0 } ] } }

Notice now that the output from the list of echos has been gathered into a JSON array and is pointed to by the cell1 attribute.

List of Cells


Now let's explore what we can do with a list of cells. For this example we'll move away from simple echoes into a more real world scenario.

Consider the following syntax:

list(cell(query1, search(...)),
      cell(query2, search(...)),
      cell(graph,  gatherNodes(...)),
      cell(facet1,  facet(...)),
      cell(facet2, facet(...)),
      cell(stats1, list(stats(...),
                               stats(...),
                               stats(...))),
      cell(recommend, significantTerms(...)))

Wow, what is going on here? Well something pretty exciting...

The expression above is performing multiple searches, a graph expression, multiple facet and stats expressions and a significantTerms expression. The results of each of these expressions will be nicely separated into lists of tuples which can be accessed by their named attribute.

Each of these expressions can have different queries and access different collections. Other expressions can access databases using the jdbc expression. And custom expressions can be added that stream from other data sources.

Note that you would want to use expressions that return bounded result sets with this approach. For example a search expression can be used to return the top N search results.


Streaming Response


In the example above the list expression will move sequentially through each cell. This means that data begins streaming as soon as the first cell returns tuples.

This is very different then a standard Solr response that gathers docs, facets, stats etc... and only sends the response when all the data is collected.

The effect of this is that data will start flowing much faster with the streaming approach. Total throughput time will be slower though because the normal Solr query path maximizes throughput.

But if the goal is to return something to the user as fast as possible the Streaming approach works better.

   

Wednesday, April 19, 2017

Having a chat with Solr using the new echo Streaming Expression

In the next release of Solr, there is a new and interesting Streaming Expression called echo.

echo is a very simple expression with the following syntax:

echo("Hello World")

If we send this to Solr, it responds with:

{ "result-set": { "docs": [ { "echo": "Hello World" }, { "EOF": true, "RESPONSE_TIME": 0 } ] } }

Solr simply echoes the text back, but maybe it feels a bit like Solr is talking to us. Like there might be someone there.

Well it turns out that this simple exchange is the first step towards a more meaningful conversation.

Let's take another step:

classify(echo("Customer service is just terrible!"),
             model(models, id="sentiment"),
             field="echo",
             analyzerField="message_t")

Now we are echoing text to a classifier.  The classify function is pointing to a model stored in Solr that does sentiment analysis based on the text. Notice that the classify function has an analyzer field parameter. This is a Lucene/Solr analyzer used by the classify function to pull the features from the text (See this blog for more details on the classify function).

If we send this to Solr we may get a response like this:

{ "result-set": { "docs": [ { "echo": "Customer service is just terrible!",
"probability_d":0.94888 }, { "EOF": true, "RESPONSE_TIME": 0 } ] } }

The probability_d field is the probability that the text has a negative sentiment. In this case there was a 94% probability that the text was negative.

Now Solr knows something about what's being said. We can wrap other Streaming Expressions around this to take actions or begin to formulate a response.

But we really don't yet have enough information to make a very informed response.

We can take this a bit further.

Consider this expression:

select(echo("Customer service is just terrible!"),
           analyze(echo, analyzerField) as expr_s)

The expression above uses the select expression to echo the text to the analyze Stream Evaluator. The analyze Steam Evaluator applies a Lucene/Solr analyzer to the text and returns a token stream. But in this case it returns a single token which is a Streaming Expression. 

(See this blog for more details on the analyze Stream Evaluator)

In order to make this work you would define the final step of the analyzer chain as a token filter that builds a Streaming Expression based on the natural language parsing done earlier in the analyzer chain.

Now we can wrap this construct in the new eval expression:

eval(select(echo("Customer service is just terrible!"),
                  analyze(echo, analyzerField) as expr_s))

The eval expression will compile and run the Streaming Expression created by the analyzer.  It will also emit the tuples that are emitted by the compiled expression. The tuples emitted are the response to the natural language request.

The heavy lifting is done in the analysis chain which performs the NLP and generates the Streaming Expression response.

Streaming Expressions as an AI Language

Before Streaming Expressions existed Dennis Gove shared an email with me with his initial design for the Streaming Expression syntax. The initial syntax used Lisp like S-Expressions. I took one look at the S-Expressions and realized we were building an AI language. I'll get into more detail about how this syntax ties into AI shortly, but first a little more history on Streaming Expressions.

The S-Expressions were replaced with the more familiar function syntax that Streaming Expressions has today. This decision was made by Dennis and Steven Bower. It turned out to be the right call because we now have a more familiar syntax than Lisp but we also kept many of Lisps most important qualities.

Dennis contributed the Streaming Expression parser and I began looking for something interesting to do with it. The very first thing I tried to do with Streaming Expressions was to re-write SQL queries as Streaming Expressions for the Parallel SQL interface. For this project a SQL parser was used to parse the queries and then a simple planner was built that generated Streaming Expressions to implement the physical query plan.

This was an important proving ground for Streaming Expressions for a number of reasons. It proved that Streaming Expressions could provide the functionality needed to implement the SQL query plans. It proved that Streaming Expressions could push functionality down into the search engine and also rise above the search engine using MapReduce when needed.

Most importantly from an AI standpoint it proved that we could easily generate Streaming Expressions programmatically. This was one of the key features that made Lisp a useful AI Language. The reason that Streaming Expressions are so easily generated is that the syntax is extremely regular. There are only nested functions. And because Streaming Expressions have an underlying Java object representation, we didn't have to do any String manipulation. We could work directly with the Object tree structure to build the expressions.

Why is code generation important for AI? One of the reasons is shown earlier in this blog. A core AI use case is to respond to natural language requests. One approach to doing this is to analyze the text request and then generate code to implement a response. In many ways it's similar to the problem of translating SQL to a physical query plan.

In a more general sense code generation is important in AI because you're dealing with many unknowns so it can be difficult to code everything up front. Sometimes you may need to generate logic on the fly.

Domain Specific Languages

Lisp has the capability of adapting its syntax for specific domains through it's powerful macro feature. Streaming Expressions has this capability as well, but it does it a different way.

Each Streaming Expression is implemented in Java under the covers. Each Streaming Expression is responsible for parsing it's own parameters. This means you can have Streaming Expressions that invent their own little languages. The select expression is a perfect example of this.

The basic select expression looks like this:

select(expr, fielda as outField)

This select reads tuples from a stream and outputs fielda as outField. The Streaming Expression parser has no concept of the word "as". This is specific to the select expression and the select expression handles the parsing of "as".

The reason why this works is that under the covers Streaming Expressions see all parameters as lists that it can manipulate any way it wants.

Embedded In a Search Engine

Having an AI language embedded in a search engine is a huge advantage. It allows expressions to leverage vast amounts of information in interesting ways. The inverted index already has important statistics about the text which can be used for machine learning. Search engines have strong facilities for working with text (tokenizers, filters etc..) and in recent years they've become powerful column stores for numeric calculations. They also have mature content ingestion and parallel query frameworks.

Now there is a language that ties it all together.

Streaming Expression's Powerful New Data Structures

In the next release of Solr, the Streaming Expression library includes two powerful new data structures called list and cell.  In this blog...