Monday, February 27, 2017

Anomaly Detection in Solr 6.5

Solr 6.5 is just around the corner and along with it comes the new significantTerms Streaming Expression. The significantTerms expression queries a Solr Cloud collection but instead of returning the matching documents, it returns the significant terms in the matching documents.

To determine the significance of a term a formula is used which considers the number of times the term appears in the foreground set versus the number of times the term appears in the background set. The foreground set is the search result. The background set is all the documents in the index.

The significantTerms function assigns higher scores to terms that are more frequent in the foreground set and rarer in the background set, in relation to other terms.

For example:

Term     Foreground    Background
A           100                   103
B           101                   1000

Term A would be considered more significant then term B, because term A is much more rare in the background set.

This model for scoring terms can be very useful for spotting anomalies in the data. Specifically we can easily surface terms that are unusually aligned with specific result sets.


A Simple Example with the Enron Emails


For this example we'll start with a single Enron email address (tana.jones@enron.com) and ask the question:

Which address has the most significant relationship with tana.jones@enron.com?

We can start looking for an answer by running an aggregation. Since we're using Streaming Expressions we'll use the facet expression:

facet(enron,
         q="from:tana.jones@enron.com",
         buckets="to",
         bucketSorts="count(*) desc",
         bucketSizeLimit="100",
         count(*))

This expression queries the index for tana.jones@enron.com in the from field and gathers the facet buckets and counts from the to field. It returns the top 100 facet buckets from the to field ordered by the counts in descending order.

This expression returns the top 100 addresses that tana.jones@enron.com has emailed. The top five results look like this:

{
"result-set": {
"docs": [{
"count(*)": 789,
"to": "alan.aronowitz@enron.com"
}, {
"count(*)": 376,
"to": "frank.davis@enron.com"
}, {
"count(*)": 372,
"to": "mark.taylor@enron.com"
}, {
"count(*)": 249,
"to": "brent.hendry@enron.com"
}, {
"count(*)": 197,
"to": "bob.bowen@enron.com"
}, ...

This gives some useful information but does it answer the question?  The top address is alan.aronowitz@enron.com with a count of 789. Is this the most significant relationship?

Let's see if the significantTerms expression can surface an anomaly. Here is the expression:

significantTerms(enron, q="from:tana.jones@enron.com", field="to", limit="20")

The expression above runs the query from:tana.jones@enron.com on the enron collection. It then collects the top 20 significant terms from the to field.

The top five results look like this:

{
"result-set": {
"docs": [{
"score": 54.370163,
"term": "michael.neves@enron.com",
"foreground": 130,
"background": 132
}, {
"score": 53.911552,
"term": "lisa.lees@enron.com",
"foreground": 186,
"background": 243
}, {
"score": 53.806202,
"term": "frank.davis@enron.com",
"foreground": 376,
"background": 596
}, {
"score": 51.760098,
"term": "harry.collins@enron.com",
"foreground": 106,
"background": 150
}, {
"score": 51.471268,
"term": "edmund.cooper@enron.com",
"foreground": 132,
"background": 222
}

We have indeed surfaced an interesting anomaly. The first term is michael.neves@enron.com. This address has a foreground count of 130 and background count of 132. This means that michael.neves@enron.com has received 132 emails in the entire corpus and 130 of them have been from tana.jones@enron.com. This signals a strong connection.

alan.aronowitz@enron.com, the highest total receiver of emails from tana.jones@enron.com, isn't in the top 5 results from the significantTerms function.

alan.aronowitz@enron.com shows up at number 8 in the list:
{
"score": 49.847652,
"term": "alan.aronowitz@enron.com",
"foreground": 789,
"background": 2117
}

Notice that the foreground count is 789 and background count is 2117. This means that 37% of the emails received by alan.aronowitz@enron.com were from tana.jones@enron.com.

98% of the emails received by michael.neves@enron.com came from tana.jones@enron.com.

significantTerms VS scoreNodes


The significantTerms function works directly with the inverted index and can score terms from a single-value, multi-value and text fields.

The scoreNodes function scores tuples emitted by graph expressions. This allows for anomaly detection in distributed graphs. A prior blog covers the scoreNodes function in more detail.

In Solr 6.5 the scoreNodes scoring algorithm was changed to better surface anomalies. The significantTerms and scoreNodes functions now use the same scoring algorithm.

Use Cases


Anomaly detection has interesting use cases including:

1) Recommendations: Finding products that are unusually connected based on past shopping history.

2) Auto-Suggestion: Suggesting terms that go well together based on indexed query logs.

3) Fraud Anomalies: Finding vendors that are unusually associated with credit card fraud.

4) Text Analytics: Finding significant terms relating to documents in a full text search result set.

5) Log Anomalies: Finding IP addresses that are unusually associated with time periods of suspicious activity.

Time Series Cross-correlation and Lagged Regression With Solr Streaming Expresssions

One of the more interesting capabilities in Solr's new statistical library is cross-correlation . But before diving into cross-correlat...