To determine the significance of a term a formula is used which considers the number of times the term appears in the foreground set versus the number of times the term appears in the background set. The foreground set is the search result. The background set is all the documents in the index.
The significantTerms function assigns higher scores to terms that are more frequent in the foreground set and rarer in the background set, in relation to other terms.
Term Foreground Background
A 100 103
B 101 1000
Term A would be considered more significant then term B, because term A is much more rare in the background set.
This model for scoring terms can be very useful for spotting anomalies in the data. Specifically we can easily surface terms that are unusually aligned with specific result sets.
A Simple Example with the Enron Emails
For this example we'll start with a single Enron email address (firstname.lastname@example.org) and ask the question:
Which address has the most significant relationship with email@example.com?
We can start looking for an answer by running an aggregation. Since we're using Streaming Expressions we'll use the facet expression:
This expression queries the index for firstname.lastname@example.org in the from field and gathers the facet buckets and counts from the to field. It returns the top 100 facet buckets from the to field ordered by the counts in descending order.
This expression returns the top 100 addresses that email@example.com has emailed. The top five results look like this:
This gives some useful information but does it answer the question? The top address is firstname.lastname@example.org with a count of 789. Is this the most significant relationship?
Let's see if the significantTerms expression can surface an anomaly. Here is the expression:
significantTerms(enron, q="from:email@example.com", field="to", limit="20")
The expression above runs the query from:firstname.lastname@example.org on the enron collection. It then collects the top 20 significant terms from the to field.
The top five results look like this:
We have indeed surfaced an interesting anomaly. The first term is email@example.com. This address has a foreground count of 130 and background count of 132. This means that firstname.lastname@example.org has received 132 emails in the entire corpus and 130 of them have been from email@example.com. This signals a strong connection.
firstname.lastname@example.org, the highest total receiver of emails from email@example.com, isn't in the top 5 results from the significantTerms function.
firstname.lastname@example.org shows up at number 8 in the list:
Notice that the foreground count is 789 and background count is 2117. This means that 37% of the emails received by email@example.com were from firstname.lastname@example.org.
98% of the emails received by email@example.com came from firstname.lastname@example.org.
significantTerms VS scoreNodes
The significantTerms function works directly with the inverted index and can score terms from a single-value, multi-value and text fields.
The scoreNodes function scores tuples emitted by graph expressions. This allows for anomaly detection in distributed graphs. A prior blog covers the scoreNodes function in more detail.
In Solr 6.5 the scoreNodes scoring algorithm was changed to better surface anomalies. The significantTerms and scoreNodes functions now use the same scoring algorithm.
Anomaly detection has interesting use cases including:
1) Recommendations: Finding products that are unusually connected based on past shopping history.
2) Auto-Suggestion: Suggesting terms that go well together based on indexed query logs.
3) Fraud Anomalies: Finding vendors that are unusually associated with credit card fraud.
4) Text Analytics: Finding significant terms relating to documents in a full text search result set.
5) Log Anomalies: Finding IP addresses that are unusually associated with time periods of suspicious activity.