Monday, February 27, 2017

Anomaly Detection in Solr 6.5

Solr 6.5 is just around the corner and along with it comes the new significantTerms Streaming Expression. The significantTerms expression queries a Solr Cloud collection but instead of returning the matching documents, it returns the significant terms in the matching documents.

To determine the significance of a term a formula is used which considers the number of times the term appears in the foreground set versus the number of times the term appears in the background set. The foreground set is the search result. The background set is all the documents in the index.

The significantTerms function assigns higher scores to terms that are more frequent in the foreground set and rarer in the background set, in relation to other terms.

For example:

Term     Foreground    Background
A           100                   103
B           101                   1000

Term A would be considered more significant then term B, because term A is much more rare in the background set.

This model for scoring terms can be very useful for spotting anomalies in the data. Specifically we can easily surface terms that are unusually aligned with specific result sets.

A Simple Example with the Enron Emails

For this example we'll start with a single Enron email address ( and ask the question:

Which address has the most significant relationship with

We can start looking for an answer by running an aggregation. Since we're using Streaming Expressions we'll use the facet expression:

         bucketSorts="count(*) desc",

This expression queries the index for in the from field and gathers the facet buckets and counts from the to field. It returns the top 100 facet buckets from the to field ordered by the counts in descending order.

This expression returns the top 100 addresses that has emailed. The top five results look like this:

"result-set": {
"docs": [{
"count(*)": 789,
"to": ""
}, {
"count(*)": 376,
"to": ""
}, {
"count(*)": 372,
"to": ""
}, {
"count(*)": 249,
"to": ""
}, {
"count(*)": 197,
"to": ""
}, ...

This gives some useful information but does it answer the question?  The top address is with a count of 789. Is this the most significant relationship?

Let's see if the significantTerms expression can surface an anomaly. Here is the expression:

significantTerms(enron, q="", field="to", limit="20")

The expression above runs the query on the enron collection. It then collects the top 20 significant terms from the to field.

The top five results look like this:

"result-set": {
"docs": [{
"score": 54.370163,
"term": "",
"foreground": 130,
"background": 132
}, {
"score": 53.911552,
"term": "",
"foreground": 186,
"background": 243
}, {
"score": 53.806202,
"term": "",
"foreground": 376,
"background": 596
}, {
"score": 51.760098,
"term": "",
"foreground": 106,
"background": 150
}, {
"score": 51.471268,
"term": "",
"foreground": 132,
"background": 222

We have indeed surfaced an interesting anomaly. The first term is This address has a foreground count of 130 and background count of 132. This means that has received 132 emails in the entire corpus and 130 of them have been from This signals a strong connection., the highest total receiver of emails from, isn't in the top 5 results from the significantTerms function. shows up at number 8 in the list:
"score": 49.847652,
"term": "",
"foreground": 789,
"background": 2117

Notice that the foreground count is 789 and background count is 2117. This means that 37% of the emails received by were from

98% of the emails received by came from

significantTerms VS scoreNodes

The significantTerms function works directly with the inverted index and can score terms from a single-value, multi-value and text fields.

The scoreNodes function scores tuples emitted by graph expressions. This allows for anomaly detection in distributed graphs. A prior blog covers the scoreNodes function in more detail.

In Solr 6.5 the scoreNodes scoring algorithm was changed to better surface anomalies. The significantTerms and scoreNodes functions now use the same scoring algorithm.

Use Cases

Anomaly detection has interesting use cases including:

1) Recommendations: Finding products that are unusually connected based on past shopping history.

2) Auto-Suggestion: Suggesting terms that go well together based on indexed query logs.

3) Fraud Anomalies: Finding vendors that are unusually associated with credit card fraud.

4) Text Analytics: Finding significant terms relating to documents in a full text search result set.

5) Log Anomalies: Finding IP addresses that are unusually associated with time periods of suspicious activity.

New York - Coronavirus Statistics (NYTimes Data Set)

As of 2020-04-09 New York City - Cumulative Cases By Day New York City - Cumulative Deaths By Day ...