1) A Text Classifier: Solr's Streaming Expressions allow you to train, store, and deploy a text classifier. With a text classifier you can train a model to determine the probability that a document belongs to a certain class. This provides the brains behind the alert.
2) A Messaging System: Solr's Streaming Expressions library provides the topic function, which allows clients to subscribe to a query. Once the subscription is established the topic provides one-time delivery of documents that match the topic query. This provides the nervous system of the alerting engine.
3) Actors: Solr's Streaming Expressions library provides the daemon function, which are processes that live inside Solr. Daemon's have the ability to subscribe to topics, apply a classifier, update collections and take other actions. This provides the muscle behind the alerts.
Use Case: A Threat Alerting Engine
Let's walk through the steps for building an engine that detects threats in social media posts and sends alerts.
Step 1: Train a model that classifies threats
To train the model we need to assemble a training set comprised of positive and negative examples of the class. In this case we need to gather a set of social media posts that are threats and another set of social media posts that are not threats.
Once we have our training set we can load the data into a Solr Cloud collection. Then we can use the features, train and update functions to extract the key features, train the model (based on the features) and store the model iterations in a Solr Cloud collection :
Let's explore this expression in more detail.
The features function extracts the key features (terms) from the training data. The features function scores terms using a technique known as Information Gain to determine which features are the most important in distinguishing the positive and negative set. The features function also extracts the term statistics from the training set needed to build the model. The online documentation contains more detailed information for the features function.
The train function uses the terms and statistics emitted by the features function to train the model. The train function use a parallel iterative, batch gradient descent approach to train a logistic regression text classifier. The train function emits a tuple for each iteration of the model. Each model tuple includes the terms, weights, error rate and a confusion matrix that describes the classification errors for the iteration. The online documentation contains more detailed information for the train function.
The update function is used to store each iteration of the model in a Solr Cloud collection. The online documentation contains more detailed information for the update function.
Step 2: Inspect and test the model
The next step is to validate that you have a good model. A quick inspection of the model iterations can be useful in understanding the model. Here are a few things to look for:
a) Look at the terms in the model. A quick overview of the terms can be useful in understanding what the key features are in the model.
b) Look at weights for the terms to get a feel for how the terms are weighted in the model.
c) Look at the confusion matrix for the last iteration of the model to get a feeling for error rates that occurred in the final training iteration.
d) Look at the iterations to see if the model converged, which means the error rates gradually reduced until they reach a point of stabilization.
After reviewing the model we can test the model on a test dataset. The test dataset should consist of positive and negative examples of the class and should not be the same as the training dataset.
The expression below runs the classifier on the test set:
Let's breakdown the expression above.
The classify function calls the model function to retrieve a named model stored in a Solr Cloud collection. In the example above it retrieves the threatModel which we built earlier. The model function retrieves the highest iteration of the model found in the collection and caches it in memory for a specified period of time. For more detailed information on the classify, model and search functions you can review the documentation.
Once the classify function has the model, it reads tuples from an internal stream and scores the tuples by applying the model to a text field in the document. The classify function emits all the tuples from the underlying stream and adds a field called probability_d, which is a score between 0 and 1 that describes the probability that the tuple is in the class. The higher the score, the higher the probability. You can start with the score of .5 or greater as the threshold for positive or negative class assignment.
You can then read the tuples to see how they were scored and determine how accurate the classifier is. If there are false positives or false negatives in the test results then inspect the records to see why they might have been misclassified. If there are missing features in the training set you may need to add more examples that include these missing features and rebuild the model.
You can also adjust the threshold to see if it creates a more accurate classifier. For example if changing the threshold from .5 to .6 provides fewer false positives without increasing false negatives then you can use that as the threshold for the classifier when deployed.
Step 3: Setup the Actor
Once you're satisfied with the model, its time to create an actor that can read new documents as they enter the index, classify them and index the threats in a separate collection.
To do this we'll setup a daemon to run the classify function. But instead of using a search function as the internal stream to classify, we'll use a topic.
Here is the basic syntax for this:
Let's explore this expression from inside-out.
The inner topic expression is subscribing to the messages collection. The messages collection holds the social media messages we are alerting on. The topic will provide one time delivery of documents in this collection. Each time it is called it will return a new batch of documents.
The classify function wraps the topic and scores each tuple based on the model retrieved by the internal model function. The classify function emits the tuples from the topic and adds the class score to the outgoing tuples in the probability_d field.
The having expression wraps the classify function and only emits tuples with a probability_d greater than 0.5. The having function is setting the classification threshold for the alert. The online documentation contains more detailed information for the having function.
The update function indexes the tuples emitted by the having function into a Solr Cloud collection called threats. The threats collection is where the messages classified as threats are stored.
The daemon function wraps the update function and calls it at intervals using an internal thread. Each time the daemon runs, a new batch of messages will be classified and the threats will be added to the threats collection. New records added to the messages collection will automatically be classified in the order they were added. The online documentation contains more detailed information for the daemon function.
Step 4: Adding actors to watch the threats collection
We now have a threats collection where the threats are being indexed. Any number of actors can subscribe to the threats collection using a topic function or TopicStream java class. These actors can be daemons running inside Solr or actors external to Solr.
In this scenario the threats collection is used as a message queue for other actors. These actors can then be programmed to take specific actions based on the threat.