Starting with Solr 6.2 you'll be able to train a Logistic Regression Text Classifier on training sets stored in Solr Cloud. This means Solr can now build machine learning models natively.
Now Solr can learn.
What is a Logistic Regression Text Classifier?
Logistic regression is a supervised machine learning classification algorithm. Logistic regression is used to train an AI model that classifies data. Logistic Regression is a binary classifier which means that it's used to classify if something belongs to a class or not.
Logistic Regression is a supervised algorithm, which means it needs a training set that provides positive and negative examples of the class to build the model.
Logistic Regression models are trained using specific features that describe the data. With text classification the features of the data are the terms in the documents.
Once the model is trained it can be used to classify other documents based on their features (terms).
The model returns a score predicting the probability that a document is in the class or not.
Solr's Implementation Through Streaming Expressions
Solr 6.2 has two new Streaming Expression functions: features and train.
Features: Extracts the key terms from a training set stored in a Solr Cloud collection. The features function uses an algorithm called Information Gain to determine which terms are most important for the specific training set.
Train: Uses the extracted features to train the Logitistic Regression Model. The train function uses a parallel iterative, batch Gradient Descent approach to optimize the model. The algorithm is embedded into Solr's core so only the model is streamed across the network with each iteration.
Storing Features and Models
The output from both the features and train function can be redirected to another SolrCloud collection using the update function.
This approach allows Solr to store millions of models that can be easily retrieved and deployed.
Learning What Users Like
One of the interesting use cases for this feature is to build models for every user which encapsulates what users like to read.
In order to build a model for each user, we need to pull together positive and negative data sets for each user. The positive set includes documents that the user has an interest in and the negative set contains documents the user has not shown a particular interest in.
Using Graph Expressions To Build The Positive Set
Application usage logs can be queried to find what the user likes to read (positive set). Indexing the usage logs in Solr Cloud allows us to run graph queries that identify the positive set.
Graph queries can be run that return a set of documents that the user has read. Graph queries can also be used to expand the training set. For example the training set can be expanded to include the documents that are most frequently viewed in the same session with documents that the user has viewed.
Graph queries can also be used to expand the positive training set through collaborative filtering. Using this approach documents read by users that have similar reading habits to the user can be pulled into the positive set.
Collecting the Negative Training Set
Logistic Regression requires a negative training set which can be gathered by down sampling the main corpus. This means taking a random sample from the main corpus that is similar in size to the positive training set.
The random Streaming Expression can be used for down sampling. The random Streaming Expression returns a random set of documents that match a query.
Low frequency terms from the positive training set can be extracted and used to negate documents from the random negative training set. This would ensure that the negative training set doesn't include key terms from the positive training set.
Using Easy Ensemble to Provide Larger Samples of Negative Data
If the negative training set is too small to represent the main corpus an approach known as Easy Ensemble can be used to expand the sample size of the negative set.
With Easy Ensemble multiple down sampled negative sets are collected and the positive training set is trained against each negative set. This will train an ensemble of models that can be used to classify documents.
With the ensemble approach the output from all the models are averaged to arrive at an ensemble classification.
Automating the Building of Training Sets With Daemons
Solr 6 introduced daemons. Daemons are Streaming Expressions that live inside Solr and run in the background at intervals.
Daemons can now be used to setup processes inside Solr that monitor the logs and build models for users as new data enters the logs.
Once the daemon processes are in place Solr can learn on it's own.
Future blogs will discuss how to setup daemons to monitor the logs and take action.
Re-Ranking and Recommending
Once the models are stored in a Solr Cloud collection they can be retrieved and used to score documents. The score reflects the probability that the user will like the document.
The scores can be applied in Solr's re-ranker to re-order the top N search results. This provides custom rankings for each user.
The scores can also be used to score recommendations created from graph expressions or More Like This queries. This provides personalized recommendations.