tag:blogger.com,1999:blog-4473140810384931922024-03-12T18:19:12.284-07:00Solr AnalyticsJoel Bernsteinhttp://www.blogger.com/profile/15670652646728608504noreply@blogger.comBlogger33125tag:blogger.com,1999:blog-447314081038493192.post-84299178014184709762021-02-17T13:05:00.065-08:002021-02-23T15:56:40.723-08:00Solr temporal graph queries for event correlation, root cause analysis and temporal anomaly detection <p><b>Temporal</b> <b>graph queries</b> will be available in the 8.9 release of Apache Solr. Temporal graph queries are designed for key log analytics use cases such as <b>event correlation</b> and <b>root cause analysis</b>. This blog provides a first look at this important new feature.</p><p><b><span style="font-size: medium;">Graph Expressions</span></b></p><p>Graph expressions were first introduced in Solr 6.1 as a general purpose breadth first graph walk with aggregations. In this blog we'll review the graph theory behind Solr's graph expressions and learn how the new temporal graph queries can be applied to <b>event correlation</b> and <b>root cause analysis</b> use cases.</p><p><b>Graphs</b></p><p>Log records and other data indexed in Solr have connections between them that can be seen as a distributed graph. Solr graph expressions provide a mechanism for identifying root nodes in the graph and <b>walking</b> their connections. The general goal of the graph walk is to materialize a specific <b>subgraph</b> and perform <b>link analysis</b>.</p><p>In the next few sections below we'll review the graph theory behind Solr's graph expressions.</p><p><b>Subgraphs</b></p><p>A subgraph is a smaller subset of the nodes and connections of the larger graph. Graph expressions allow you to flexibly define and materialize a subgraph from the larger graph stored in the distributed index. </p><p>Subgraphs play two important roles:</p><p>1) They provide a specific context for link analysis. The design of the subgraph defines the meaning of the link analysis.</p><p>2) They provide a foreground graph that can be compared to the background index for anomaly detection purposes.</p><p><b>Bipartite Subgraphs</b></p><p>Graph expressions can be used to materialize bipartite subgraphs. A bipartite graph is a graph where the nodes are split into two distinct categories. The links between those two categories can then be analyzed to study how they relate. Bipartite graphs are often discussed in the context of collaborative filter recommender systems. </p><p>A bipartite graph between <b>shopping baskets</b> and <b>products </b>is a useful example.<b> </b>Through link analysis between the <b>shopping baskets</b> and <b>products</b> we can determine which products are most often purchased within the same shopping baskets.</p><p>In the example below there is a Solr collection called baskets with three fields:</p><p><b>id</b>: Unique ID</p><p><b>basket_s</b>: Shopping basket ID</p><p><b>product_s</b>: Product</p><p>Each record in the collection represents a product in a shopping basket. All products in the same basket share the same basket ID.</p><p>Let's consider a simple example where we want to find a product that is often sold with <b>butter</b>. In order to do this we could create a bipartite subgraph of shopping baskets that contain <b>butter</b>. We won't include butter itself in the graph as it doesn't help with finding a complementary product for butter. </p><p>Below is an example of this <b>bipartite subgraph</b> represented as a matrix:</p><div class="separator" style="clear: both; text-align: center;"><br /></div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-42Vhg9TxQV8/YCwKVjU4E8I/AAAAAAAAB2U/FnYHNy2TNW4N0dsbaYz3adXsCvvrhe72gCLcBGAsYHQ/s922/Screen%2BShot%2B2021-02-16%2Bat%2B1.09.06%2BPM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="528" data-original-width="922" height="293" src="https://1.bp.blogspot.com/-42Vhg9TxQV8/YCwKVjU4E8I/AAAAAAAAB2U/FnYHNy2TNW4N0dsbaYz3adXsCvvrhe72gCLcBGAsYHQ/w513-h293/Screen%2BShot%2B2021-02-16%2Bat%2B1.09.06%2BPM.png" width="513" /></a></div><p><br /></p><p>In this example there are three shopping baskets shown by the rows: basket1, basket2, basket3.</p><p>There are also three <b>products</b> shown by the columns: cheese, eggs, milk.</p><p>Each cell has a 1 or 0 signifying if the product is in the basket.</p><p>Let's look at how Solr graph expressions materializes this bipartite subgraph:</p><p>The <b>nodes</b> function is used to materialize a <b>subgraph</b> from the larger graph. Below is an example nodes function which materializes the bipartite graph shown in the matrix above.</p>
<pre>nodes(baskets,
random(baskets, q="product_s:butter", fl="basket_s", rows="3"),
walk="basket_s->basket_s",
fq="-product_s:butter",
gather="product_s",
trackTraversal="true")
</pre>
<p><span style="font-family: inherit;">Let's break down this example starting with the <b>random</b> function:</span></p><p><span style="font-family: courier;">random(baskets, q="product_s:butter", fl="basket_s", rows="3")</span></p><p><span style="font-family: inherit;">The <b>random </b>function is searching the <b>baskets</b> collection with the query <b>product_s:butter,</b> and returning 3 random samples. Each sample contains the <b>basket_s</b> field which is the basket id. The three basket id's that are returned by the random sample are the <b>root nodes</b> of the graph query.</span></p><p><span style="font-family: inherit;">The <b>nodes</b> function is the graph query. The nodes function is operating over the three root nodes returned by the random function. It "walks" the graph by searching the basket_s field of the root nodes against the basket_s field in the index. </span>This finds all the product records for the root baskets. It will then "gather" the <b>product_s </b>field from the records it finds in the walk. <span style="font-family: inherit;">A filter is applied so that records with </span><b style="font-family: inherit;">butter</b><span style="font-family: inherit;"> in the </span><b style="font-family: inherit;">product_s</b><span style="font-family: inherit;"> field will not be returned.</span></p><p><span style="font-family: inherit;">The <b>trackTraversa</b>l flag tells the <b>nodes</b> expression to track the links between the root baskets and products.</span></p><p><b style="font-family: inherit;">Node Sets</b></p><p><span style="font-family: inherit;">The output of the<b> nodes </b>function<b> </b>is a <b>node set </b>that represents the <b>subgraph</b> specified by the nodes function. The node set contains a unique set of nodes that are gathered during the graph walk. The "node" property in the result is the value of the gathered node. In the shopping basket example the <b>product_s</b> field is in the "node" property because that was what was specified to be gathered in the nodes expression.</span></p><p><span style="font-family: inherit;">The output of the shopping basket graph expression is as follows:</span></p><p><span style="font-family: inherit;"><span style="font-family: monospace; font-size: 12px; white-space: pre;">{
"result-set": {
"docs": [
{
"node": "eggs",
"collection": "baskets",
"field": "product_s",
"ancestors": [
"basket1",
"basket3"
],
"level": 1
},
{
"node": "cheese",
"collection": "baskets",
"field": "product_s",
"ancestors": [
"basket2"
],
"level": 1
},
{
"node": "milk",
"collection": "baskets",
"field": "product_s",
"ancestors": [
"basket1",
"basket2"
],
"level": 1
},
{
"EOF": true,
"RESPONSE_TIME": 12
}
]
}
}</span></span></p><p>The <b>ancestors </b>property in the result contains a unique, alphabetically sorted list of all the <b>incoming links</b> to the node in the <b>subgraph</b>. In this case it shows the basket IDs that are linked to each product. The ancestor links will only be tracked when the <b>trackTraversal </b>flag<b> </b>is<b> </b>turned on in the nodes expression.</p><p><b>Link Analysis and Degree Centrality</b></p><p>Link analysis is often performed to determine node <b>centrality</b>. When analyzing for centrality the goal is to assign a weight to each node based on how connected it is in the subgraph. There are different types of node centrality. Graph expressions very efficiently calculates <b>inbound</b> <b>degree centrality (indegree)</b>. </p><p>Inbound degree centrality is calculated by counting the number of inbound links to each node. For brevity this article will refer to inbound degree simply as degree.</p><p>Back to the shopping basket example:</p><p><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-XLeVKp_RDOc/YCwWcgQX9KI/AAAAAAAAB2w/bfPQHTLpsKEFc9B9RtWbZNVElsw10gmRACLcBGAsYHQ/s922/Screen%2BShot%2B2021-02-16%2Bat%2B1.09.06%2BPM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="528" data-original-width="922" height="285" src="https://1.bp.blogspot.com/-XLeVKp_RDOc/YCwWcgQX9KI/AAAAAAAAB2w/bfPQHTLpsKEFc9B9RtWbZNVElsw10gmRACLcBGAsYHQ/w498-h285/Screen%2BShot%2B2021-02-16%2Bat%2B1.09.06%2BPM.png" width="498" /></a></div><div><br /></div><div>We can calculate the <b>degree</b> of the products in the graph by summing the columns:</div><div>cheese: 1</div><div>eggs: 2</div><div>milk: 2<br /><div><p>From the degree calculation we know that <b>eggs </b>and <b>milk</b> appear more frequently in shopping baskets with <b>butter</b> than <b>cheese</b> does.</p><p>The <b>nodes </b>function can calculate degree centrality by adding the <b>count(*) </b>aggregation as shown below:</p><pre>nodes(baskets,
random(baskets, q="product_s:butter", fl="basket_s", rows="3"),
walk="basket_s->basket_s",
fq="-product_s:butter",
gather="product_s",
trackTraversal="true",
count(*)) </pre><p>The output of this graph expression is as follows:</p><p><span style="font-family: monospace; font-size: 12px; white-space: pre;">{
"result-set": {
"docs": [
{
"node": "eggs",
"count(*)": 2,
"collection": "baskets",
"field": "product_s",
"ancestors": [
"basket1",
"basket3"
],
"level": 1
},
{
"node": "cheese",
"count(*)": 1,
"collection": "baskets",
"field": "product_s",
"ancestors": [
"basket2"
],
"level": 1
},
{
"node": "milk",
"count(*)": 2,
"collection": "baskets",
"field": "product_s",
"ancestors": [
"basket1",
"basket2"
],
"level": 1
},
{
"EOF": true,
"RESPONSE_TIME": 17
}
]
}
}</span></p><p>The <b>count(*)</b> aggregation counts the "gathered" nodes, in this case the values in the <b>product_s</b> field. Notice that the <b>count(*)</b> result is the same as the number of <b>ancestors</b>. This will always be the case because the nodes function first deduplicates the edges before counting the gathered nodes. Because of this the count(*) aggregation always calculates the <b>degree centrality</b> for the gathered nodes.</p><p><b>Dot Product</b></p><p>There is a direct relationship between the <b>inbound degree</b> with bipartite graph recommenders and the <b>dot product</b>. This relationship can be clearly seen in our working example once you include a column for butter:</p><p><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-I6mkX95iXg4/YDUZMuyrGfI/AAAAAAAAB5Q/W4W7TEzfCbQ0j4zXY-ii2dzXhv4lqPG3wCLcBGAsYHQ/s936/Screen%2BShot%2B2021-02-23%2Bat%2B10.02.26%2BAM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="520" data-original-width="936" height="275" src="https://1.bp.blogspot.com/-I6mkX95iXg4/YDUZMuyrGfI/AAAAAAAAB5Q/W4W7TEzfCbQ0j4zXY-ii2dzXhv4lqPG3wCLcBGAsYHQ/w495-h275/Screen%2BShot%2B2021-02-23%2Bat%2B10.02.26%2BAM.png" width="495" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><p>If we compute the dot product between the <b>butter </b>column and the other product columns you will find that the <b>dot product</b> equals the <b>inbound degree</b> in each case. This tells us that a <b>nearest neighbor search</b>, using a maximum inner product similarity, would select the column with the highest inbound degree. </p><p><b>Node Scoring</b></p><p>The degree of the node describes how many nodes in the subgraph link to it. But this does not tell us if the node is particularly central to this subgraph or if it is just a very frequent node in the entire graph. Nodes that appear frequently in the subgraph but infrequently in the entire graph can be considered more <b>relevant </b>to the subgraph. </p><p>The search index contains information about how frequently each node appears in the entire index. Using a technique similar to <b>tf-idf </b>document scoring, graph expressions can combine the degree of the node with its <b>inverse document frequency</b> in the index to determine a relevancy score. </p><p>The <b>scoreNodes </b>function scores the nodes. Below is an example of the scoreNodes function applied to the shopping basket node set.</p><pre>scoreNodes(nodes(baskets,
random(baskets, q="product_s:butter", fl="basket_s", rows="3"),
walk="basket_s->basket_s",
fq="-product_s:butter",
gather="product_s",
trackTraversal="true",
count(*)))</pre>
The output now includes a <b>nodeScore </b>property. In the output below notice how <b>eggs</b> has a higher nodeScore than <b>milk</b> even though they have the same count(*). This is because milk appears more frequently in the entire index than eggs does. Because of this eggs is considered more relevant to this subgraph, and a better recommendation to be paired with butter.
<p>
<span style="font-family: monospace; font-size: 12px; white-space: pre;">{
"result-set": {
"docs": [
{
"node": "eggs",
"nodeScore": 3.8930247,
"field": "product_s",
"numDocs": 10,
"level": 1,
"count(*)": 2,
"collection": "baskets",
"ancestors": [
"basket1",
"basket3"
],
"docFreq": 2
},
{
"node": "milk",
"nodeScore": 3.0281217,
"field": "product_s",
"numDocs": 10,
"level": 1,
"count(*)": 2,
"collection": "baskets",
"ancestors": [
"basket1",
"basket2"
],
"docFreq": 4
},
{
"node": "cheese",
"nodeScore": 2.7047482,
"field": "product_s",
"numDocs": 10,
"level": 1,
"count(*)": 1,
"collection": "baskets",
"ancestors": [
"basket2"
],
"docFreq": 1
},
{
"EOF": true,
"RESPONSE_TIME": 26
}
]
}
}</span></p><p><b><span style="font-size: medium;">Temporal Graph Expressions</span></b></p><p>The examples above lay the groundwork for Solr's new temporal graph queries. Temporal graph queries allow Solr to walk the <b>graph using windows of time</b>. The initial release supports graph walks using ten second increments which is useful for <b>event correlation</b> and <b>root cause analysis</b> use cases in log analytics. </p><p>In order to support temporal graph queries a ten second truncated timestamp in <b>ISO 8601</b> format must be added to the log records as a string field at indexing time. Here is a sample ten second truncated timestamp: <span style="color: green; font-family: monospace; font-size: 12px; white-space: pre;">2021-02-10T20:51:30Z</span>. This small data change enables some very important use cases so it's well worth the effort.</p><p>Solr's indexing tool for Solr logs, described <a href="https://lucene.apache.org/solr/guide/8_8/logs.html#logs">here</a>, already adds the ten second truncated timestamps. So those using Solr to analyze Solr logs get temporal graph expressions for free. </p><p><b>Root Events</b></p><p>Once the ten second windows have been indexed with the log records we can devise a query that creates a set of root events. We can demonstrate this with an example using Solr log records. </p><p>In this example we'll perform a Streaming Expression <b>facet </b>aggregation that finds the top 25, ten second windows with the highest average query time. These time windows can be used to represent <b>slow query events</b> in a temporal graph query. </p><p>Here is the <b>facet</b> function:</p><p><span style="font-family: inherit;">facet(solr_logs, q="+type_s:query +distrib_s:false", buckets="time_ten_second_s", avg(qtime_i))</span></p><p>Below is a snippet of the results with the 25 windows with the highest average query times:</p><p><span style="font-family: monospace; font-size: 12px; white-space: pre;">{
"result-set": {
"docs": [
{
"avg(qtime_i)": 105961.38461538461,
"time_ten_second_s": "2020-08-25T21:05:00Z"
},
{
"avg(qtime_i)": 93150.16666666667,
"time_ten_second_s": "2020-08-25T21:04:50Z"
},
{
"avg(qtime_i)": 87742,
"time_ten_second_s": "2020-08-25T21:04:40Z"
},
{
"avg(qtime_i)": 72081.71929824562,
"time_ten_second_s": "2020-08-25T21:05:20Z"
},
{
"avg(qtime_i)": 62741.666666666664,
"time_ten_second_s": "2020-08-25T12:30:20Z"
},
{
"avg(qtime_i)": 56526,
"time_ten_second_s": "2020-08-25T12:41:20Z"
},
...</span></p><p><span style="font-family: monospace; font-size: 12px; white-space: pre;"> {
"avg(qtime_i)": 12893,
"time_ten_second_s": "2020-08-25T17:28:10Z"
},
{
"EOF": true,
"RESPONSE_TIME": 34
}
]
}
}</span></p><p><b>Temporal Bipartite Subgraphs</b></p><p>Once we've identified a set of root event windows it's easy to perform a graph query that creates a bipartite graph of the log events that occurred within the same ten second windows. With Solr logs there is a field called <b>type_s</b> which is the type of log event. </p><p>In order to see what log events happened in the same ten second window of our root events we can "walk" the ten second windows and gather the <b>type_s </b>field.</p>
<pre>nodes(solr_logs,
facet(solr_logs,
q="+type_s:query +distrib_s:false",
buckets="time_ten_second_s",
avg(qtime_i)),
walk="time_ten_second_s->time_ten_second_s",
gather="type_s",
count(*))
</pre>
<p>Below is the resulting node set:</p><p><span style="font-family: monospace; font-size: 12px; white-space: pre;">{</span></p><p><span style="font-family: monospace; font-size: 12px; white-space: pre;"> "result-set": {
"docs": [
{
"node": "query",
"count(*)": 10,
"collection": "solr_logs",
"field": "type_s",
"level": 1
},
{
"node": "admin",
"count(*)": 2,
"collection": "solr_logs",
"field": "type_s",
"level": 1
},
{
"node": "other",
"count(*)": 3,
"collection": "solr_logs",
"field": "type_s",
"level": 1
},
{
"node": "update",
"count(*)": 2,
"collection": "solr_logs",
"field": "type_s",
"level": 1
},
{
"node": "error",
"count(*)": 1,
"collection": "solr_logs",
"field": "type_s",
"level": 1
},
{
"EOF": true,
"RESPONSE_TIME": 50
}
]
}
}</span></p><p>In this result set the <b>node</b> field holds the type of log events that occurred within the same ten second windows as the root events. Notice that the event types include: query, admin, update and error. The count(*) shows the degree centrality of the different log event types.</p><p>Notice that there is 1 <b>error </b>event within the same ten second windows of the slow query events. </p><p><b>Window Parameter</b></p><p>For event correlation and root cause analysis it's not enough to find events that occur within the same ten second root event windows. What's needed is to find events that occur within a <b>window of time prior</b> to each root event window. The <b>window </b>parameter allows you to specify this prior <b>window of time</b> as part of the query. The window parameter is an integer which specifies the number of ten second time windows, prior to each root event window, to include in the graph walk. </p>
<pre>nodes(solr_logs,
facet(solr_logs,
q="+type_s:query +distrib_s:false",
buckets="time_ten_second_s",
avg(qtime_i)),
walk="time_ten_second_s->time_ten_second_s",
gather="type_s",
window="3",
count(*))
</pre><p>Below is the node set returned when the window parameter is added. Notice that there are 29 error events within the 3 ten second windows prior to the slow query events. </p><p><span style="font-family: monospace; font-size: 12px; white-space: pre;">{
"result-set": {
"docs": [
{
"node": "query",
"count(*)": 62,
"collection": "solr_logs",
"field": "type_s",
"level": 1
},
{
"node": "admin",
"count(*)": 41,
"collection": "solr_logs",
"field": "type_s",
"level": 1
},
{
"node": "other",
"count(*)": 48,
"collection": "solr_logs",
"field": "type_s",
"level": 1
},
{
"node": "update",
"count(*)": 11,
"collection": "solr_logs",
"field": "type_s",
"level": 1
},
{
"node": "error",
"count(*)": 29,
"collection": "solr_logs",
"field": "type_s",
"level": 1
},
{
"EOF": true,
"RESPONSE_TIME": 117
}
]
}
}</span></p><p><b>Degree as a Representation of Correlation</b></p><p>By performing link analysis on the temporal bipartite graph we can calculate the <b>degree</b> of each <b>event type</b> that occurs in the specified time windows. We established in the bipartite graph recommender example the direct relationship between <b>inbound degree</b> and the <b>dot product</b>. In the field of digital signal processing the dot product is used to represent correlation. In our temporal graph queries we can then view the <b>inbound degree</b> as a representation of <b>correlation</b> between the root events and the events that occur within the specified time windows.</p><p><b>Lag Parameter</b></p><p>Understanding the lag in the correlation is important for certain use cases. In a lagged correlation an event occurs and following a <b>delay</b> another event occurs. The window parameter doesn't capture the delay as we only know that an event occurred somewhere within a prior window. </p><p>The <b>lag</b> parameter can be used to <b>start calculating the window parameter </b>a number of ten second windows in the past. For example we could walk the graph in 20 seconds windows starting from 30 seconds prior to a set of root events. By adjusting the lag and re-running the query we can determine which lagged window has the highest degree. From this we can determine the delay.</p><p><b>Node Scoring and Temporal Anomaly Detection</b></p><p>The concept of node scoring can be applied to temporal graph queries to find events that are both correlated with a set of root events and <b>anomalous</b> to the root events. The degree calculation establishes the correlation between events but it does not establish if the event is a very common occurrence in the entire graph or specific to the subgraph.</p><p>The <b>scoreNodes</b> functions can be applied to score the nodes based on the <b>degree</b> and the commonality of the node's term in the index. This will establish whether the event is anomalous to the root events. </p><pre>scoreNodes(nodes(solr_logs,
facet(solr_logs,
q="+type_s:query +distrib_s:false",
buckets="time_ten_second_s",
avg(qtime_i)),
walk="time_ten_second_s->time_ten_second_s",
gather="type_s",
window="3",
count(*)))</pre>Below is the node set once the scoreNodes function is applied. Now we see that the highest scoring node is the error event. This score give us a good indication of where to begin our <b>root cause analysis</b>.
<pre><span style="font-size: 12px;">{
"result-set": {
"docs": [
{
"node": "other",
"nodeScore": 23.441727,
"field": "type_s",
"numDocs": 4513625,
"level": 1,
"count(*)": 48,
"collection": "solr_logs",
"docFreq": 99737
},
{
"node": "query",
"nodeScore": 16.957537,
"field": "type_s",
"numDocs": 4513625,
"level": 1,
"count(*)": 62,
"collection": "solr_logs",
"docFreq": 449189
},
{
"node": "admin",
"nodeScore": 22.829023,
"field": "type_s",
"numDocs": 4513625,
"level": 1,
"count(*)": 41,
"collection": "solr_logs",
"docFreq": 96698
},
{
"node": "update",
"nodeScore": 3.9480786,
"field": "type_s",
"numDocs": 4513625,
"level": 1,
"count(*)": 11,
"collection": "solr_logs",
"docFreq": 3838884
},
{
"node": "error",
"nodeScore": 26.62394,
"field": "type_s",
"numDocs": 4513625,
"level": 1,
"count(*)": 29,
"collection": "solr_logs",
"docFreq": 27622
},
{
"EOF": true,
"RESPONSE_TIME": 124
}
]
}
}</span></pre></div></div>Joel Bernsteinhttp://www.blogger.com/profile/15670652646728608504noreply@blogger.comtag:blogger.com,1999:blog-447314081038493192.post-69276162766743240502021-02-09T07:02:00.016-08:002021-02-09T10:42:17.885-08:00Driving down cloud storage costs with Apache Solr's hybrid indexed and raw log analytics engineSearch engines are powerful tools for log analytics. They excel at slicing and dicing data over clusters and running distributed aggregations and statistical analysis. But maintaining a log analytics search index for terabytes or petabytes of log data involves running huge search clusters and incurs large cloud storage expenses to store the indexes. <b>Often what's actually needed is a grep like capability that includes aggregations and visualization rather than the full power of the search index for historical data. </b><div><br /></div><div>The next release of Apache Solr provides a hybrid approach to log analytics that supports both log analytics queries over a search cluster and the ability to grep, aggregate and visualize compressed log files. </div><div><br /></div><div>Solr's Streaming Expressions and Math Expressions are a powerful query language for analytics and visualization. You can read about Streaming Expressions and Math Expressions in Solr's Visual Guide (<a href="https://lucene.apache.org/solr/guide/8_8/math-expressions.html">https://lucene.apache.org/solr/guide/8_8/math-expressions.html</a>). If you haven't seen this guide it's useful to quickly review the TOC to see the power here and compare to what ElasticSearch offers.</div><div><br /></div><div>In the next release of Solr a subset of Streaming Expressions and Math Expressions can be applied to raw compressed log files using the <b>cat </b>function. The cat function reads files from a specific place in the filesystem and returns a stream of lines from the files. The cat function can then be wrapped by other functions to parse, filter, aggregate and visualize.</div><div><br /></div><div>Below is a simple example of the <b>cat</b> function wrapped by the <b>parseCSV </b>function to parse a comma separated file into tuples which can be immediately visualized by Zeppelin using the Zepplin-Solr interpreter.</div><div><br /></div><div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-ZISWlIE5JwI/YCKV4lLuKsI/AAAAAAAAB1E/XrZ3PHtYPNI6tBuAqYjHMfjR54coGQexwCLcBGAsYHQ/s1628/csv.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1026" data-original-width="1628" height="354" src="https://1.bp.blogspot.com/-ZISWlIE5JwI/YCKV4lLuKsI/AAAAAAAAB1E/XrZ3PHtYPNI6tBuAqYjHMfjR54coGQexwCLcBGAsYHQ/w562-h354/csv.png" width="562" /></a></div><div><br /></div><div><br /></div><b>Reading GZipped Files</b></div><div><b><br /></b></div><div>In the next release of Solr the <b>cat</b> function will be able read gzipped files in place without expanding on disk. The cat function will automatically read gzipped files based on the .gz file extension. <b>Log files that are gzip compressed often have an 80% reduction in size</b>.</div><div><b><br /></b></div><div><b>On Demand Indexing</b></div><div><b><br /></b></div><div>One of the capabilities provided is on-demand indexing of historical data. There will be times when the grep and aggregate functions won't be enough to support the analytics requirement. In this scenario Streaming Expressions supports a rich set of functions for on-demand indexing from raw compressed log data. The example below shows the <b>cat</b> function wrapped by the <b>select </b>function which is renaming fields in the tuples. The <b>update</b> function then indexes the tuples to a Solr Cloud collection.</div><div><br /></div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-CJxu5Pt-rnE/YCKZgkc-IAI/AAAAAAAAB1g/Z_tZjQucgvQoOWy_g6yM6l4KwFPHYorJwCLcBGAsYHQ/s1626/update.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1236" data-original-width="1626" height="425" src="https://1.bp.blogspot.com/-CJxu5Pt-rnE/YCKZgkc-IAI/AAAAAAAAB1g/Z_tZjQucgvQoOWy_g6yM6l4KwFPHYorJwCLcBGAsYHQ/w560-h425/update.png" width="560" /></a></div><br /><div><br /></div><div>Once the data is indexed the full power of Streaming Expressions and Math Expressions can be applied to the data.</div><div><br /></div><div><b>Aggregations Over Raw Compressed Log Files</b></div><div><br /></div><div>The <b>cat </b>function can also be wrapped by the <b>having</b> function to perform regex filtering, the <b>select </b>function to transform tuples and the <b>hashRollup</b> function to perform aggregations directly over compressed log files. </div><div><br /></div><div>Let's build a time series aggregation one step at a time:</div><div><br /></div><div><span style="font-family: courier;"><b>cat("2021/01")</b></span></div><div><br /></div><div>The cat function reads all the files inside the <b>2021/01</b> directory. These are log files from January 2021. These log files are in CSV format and are gzipped individually. </div><div><br /></div><div><span style="font-family: courier;"><b>parseCSV(cat("2021/01"))</b></span></div><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: inherit;">The parseCSV function wraps the cat function and parses each CSV formatted log line into a tuple of name value pairs.</span></div><div><br /></div><div><div><span style="font-family: courier;"><b>select(parseCSV(cat("2021/01")),</b></span></div><div><span style="font-family: courier;"><b> trunc(timestamp, 10) as day,</b></span></div><div><span style="font-family: courier;"><b> long(qtime) as query_time)</b></span></div></div><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: inherit;">The <b>select</b> function transforms each tuple by truncating the timestamp field on the 10th character to return the yyyy-MM-dd part of the timestamp and mapping it to the "day" field. It also casts the qtime field to a long and maps it to the "query_time" field.</span></div><div><br /></div><div><div><span style="font-family: courier;"><b>hashRollup(</b></span></div><div><span style="font-family: courier;"><b> select(parseCSV(cat("2021/01")),</b></span></div><div><span style="font-family: courier;"><b> trunc(timestamp, 10) as day,</b></span></div><div><span style="font-family: courier;"><b> long(qtime) as query_time),</b></span></div><div><span style="font-family: courier;"><b> over="day",</b></span></div><div><span style="font-family: courier;"><b> avg(query_time))</b></span></div></div><div><br /></div><div><span style="font-family: inherit;">Finally the <b>hashRollup</b> function performs an aggregation over the truncated time stamp (day field) averaging the query_time.</span></div><div><span style="font-family: inherit;"><br /></span></div><div><span style="font-family: inherit;">The output of this expression is a time series which can be immediately <b>visualized</b> and shared in Apache-Zeppelin using the Zeppelin-Solr interpreter.</span></div><div><span style="font-family: inherit;"><br /></span></div><div><br /></div>Joel Bernsteinhttp://www.blogger.com/profile/15670652646728608504noreply@blogger.comtag:blogger.com,1999:blog-447314081038493192.post-57175394109993026282021-01-21T18:33:00.002-08:002021-01-21T18:33:23.595-08:00Optimizations Coming to Solr<p> </p><span id="docs-internal-guid-70504ac6-7fff-1716-683c-58502f09a546"><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Starting in Solr 8.8 and continuing into Solr 9.0 there are a number of optimizations to be aware of that provide breakthroughs in performance for important use cases.</span></p><br /><h2 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 18pt;"><span style="font-family: Arial; font-size: 16pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">Optimized Self Join (</span><a href="https://issues.apache.org/jira/browse/SOLR-15049" style="text-decoration-line: none;"><span style="color: #1155cc; font-family: Arial; font-size: 16pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">https://issues.apache.org/jira/browse/</span><span style="color: #1155cc; font-family: Arial; font-size: 16.5pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">SOLR-15049</span></a><span style="font-family: Arial; font-size: 16pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">)</span></h2><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">This optimization is a breakthrough in performance for </span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 700; vertical-align: baseline; white-space: pre-wrap;">document level access control</span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">. It is by far the fastest filter join implementation in the Solr project and is likely the fastest access control join available for search engines. </span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">This optimization requires that the joined documents reside in the same Solr core and that the join key field be the same for both sides of the join. For access control this means the access control records must be in the same core as the main document corpus and the join from access control records to the main documents must use the same field. </span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">The optimization in this scenario is significant. The performance of the join allows it to scale to much larger joins. For example joins that involve upwards of 500,000 join keys can be executed with sub-second performance. In an access control setting this translates to 500,000+ access control groups which can be used to filter the main document set, with sub-second performance. That represents more than a 10x increase in join size over the next fastest Solr filter join which can perform joins with up to 50,000 access control groups and achieve sub-second performance. </span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">A followup blog will discuss the technical details of this optimization and how it can be implemented in a sharded Solr Cloud installation. </span></p><h2 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 18pt;"><span style="font-family: Arial; font-size: 16pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">Tiered Aggregations (</span><a href="https://issues.apache.org/jira/browse/SOLR-15036" style="text-decoration-line: none;"><span style="color: #1155cc; font-family: Arial; font-size: 16pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">https://issues.apache.org/jira/browse/SOLR-15036</span></a><span style="font-family: Arial; font-size: 16pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">)</span></h2><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">This optimization is a breakthrough for large scale aggregations. A typical Solr aggregation using JSON facets has one aggregator node. In this scenario aggregations from all the shards are collected in one place and merged. This technique has limits in scalability because eventually the number of threads used to contact the shards and amount of time and memory it takes to perform the merge is prohibitive.</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Tiered aggregations eliminates the single aggregator bottleneck by setting up a tier of aggregator nodes. Each aggregator node performs a JSON facet aggregation for a subset of shards. Then a single top level aggregator node merges the aggregations from the tier of aggregator nodes. The partitioning of the middle tier of aggregator nodes happens automatically when aggregating over a Solr Cloud alias which points to multiple collections. In this scenario an aggregator node is assigned to each collection in the alias. </span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Tiered aggregation allows for real-time aggregations over very large clusters. For example: 200 aggregator nodes each calling 200 shards is a realistic scenario, providing real time aggregations across 40,000 shards. </span></p><br /><h2 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 18pt;"><span style="font-family: Arial; font-size: 16pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">Improved export sorting performance and efficient high cardinality aggregations (</span><a href="https://issues.apache.org/jira/browse/SOLR-14608" style="text-decoration-line: none;"><span style="color: #1155cc; font-family: Arial; font-size: 16pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">https://issues.apache.org/jira/browse/SOLR-14608</span></a><span style="font-family: Arial; font-size: 16pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">)</span></h2><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Both Solr and Elasticsearch have traditionally not been effective </span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 700; vertical-align: baseline; white-space: pre-wrap;">high cardinality aggregation engines</span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">.</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Solr’s export handler has undergone a series of performance improvements culminating with a new technique for sorting that improves the throughput of some export queries by 100%. The improved sorting performance is part of a set of changes designed to support a performant and efficient high cardinality aggregation engine.</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">High cardinality aggregation often occurs in data warehousing due to multi-dimensional aggregations that result in a high number of dimensional combinations. Traditional search engine approaches to aggregations do not work well in high cardinality use cases.</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Traditional faceted aggregation is not well suited for high cardinality aggregation because it tracks the full aggregation in memory at once. When performing high cardinality aggregation it’s often not practical to track all dimensions in memory. Solr's export handler solves this by first sorting the result set on the aggregation dimensions and then rolling up aggregations one group at a time. Using this technique high cardinality aggregations can be accomplished using a small amount of memory.</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">The export handler also now has the capability of running a Streaming Expression in memory over the sorted result set. This means high cardinality aggregations can be done inside the export handler allowing the export handler to return aggregated results. This can greatly reduce the amount of data that needs to be sent across the network to aggregate over the sorted/exported results. </span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 700; vertical-align: baseline; white-space: pre-wrap;">Spark-Solr</span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> aggregations will also benefit from the improved performance of the export handler because Spark-Solr uses the export handler to return results for aggregations.</span></p><h2 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 18pt;"><span style="font-family: Arial; font-size: 16pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">Improved collapse performance/efficiency, block collapse (</span><a href="https://issues.apache.org/jira/browse/SOLR-15079" style="text-decoration-line: none;"><span style="color: #1155cc; font-family: Arial; font-size: 16pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">https://issues.apache.org/jira/browse/SOLR-15079</span></a><span style="font-family: Arial; font-size: 16pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">)</span></h2><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Solr’s collapse feature is often used for larger e-commerce catalogs with a high number of products that don’t perform well with Lucene/Solr grouping. Solr’s collapse can now take advantage of block indexing/nested documents to significantly improve query performance, cutting search times in half in some scenarios, while decreasing the memory used by collapse by 99%. In order to take advantage of this feature catalogs will need to be indexed such that all SKU’s that share the same product ID are indexed in the same block.</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">The improved performance and efficiency will allow for more scalability (higher QPS with less hardware) and provide faster response times for ecommerce search applications. The improved performance also leaves more time to improve relevance with advanced ranking algorithms. </span></p><br /></span>Joel Bernsteinhttp://www.blogger.com/profile/15670652646728608504noreply@blogger.comtag:blogger.com,1999:blog-447314081038493192.post-34463479104859370822020-03-30T15:01:00.000-07:002020-04-10T08:02:24.251-07:00New York - Coronavirus Statistics (NYTimes Data Set)As of 2020-04-09<br />
<br />
New York City - Cumulative Cases By Day<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-NF_Me4UyAnE/XpCDK5KPG5I/AAAAAAAABh8/evAJD7HI7vkZrAyiFZqfqz7RRaUzJ8ceACLcBGAsYHQ/s1600/Screen%2BShot%2B2020-04-10%2Bat%2B10.30.55%2BAM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="505" data-original-width="1600" height="126" src="https://1.bp.blogspot.com/-NF_Me4UyAnE/XpCDK5KPG5I/AAAAAAAABh8/evAJD7HI7vkZrAyiFZqfqz7RRaUzJ8ceACLcBGAsYHQ/s400/Screen%2BShot%2B2020-04-10%2Bat%2B10.30.55%2BAM.png" width="400" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
New York City - Cumulative Deaths By Day</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-vnsgNJNZZy4/XpCDYtQmmSI/AAAAAAAABiA/XS8kyj3U1uAPQMnIYy_e3vF2EOgCaYNVACLcBGAsYHQ/s1600/Screen%2BShot%2B2020-04-10%2Bat%2B10.31.53%2BAM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="506" data-original-width="1600" height="126" src="https://1.bp.blogspot.com/-vnsgNJNZZy4/XpCDYtQmmSI/AAAAAAAABiA/XS8kyj3U1uAPQMnIYy_e3vF2EOgCaYNVACLcBGAsYHQ/s400/Screen%2BShot%2B2020-04-10%2Bat%2B10.31.53%2BAM.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
New York City - Cumulative Cases and Deaths</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-ALFjUkb4BDE/XpCDjIpJNHI/AAAAAAAABiI/3B32kBI6Q4oD57-1hIabYXN3D_j9nScSwCLcBGAsYHQ/s1600/Screen%2BShot%2B2020-04-10%2Bat%2B10.32.34%2BAM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="490" data-original-width="1600" height="122" src="https://1.bp.blogspot.com/-ALFjUkb4BDE/XpCDjIpJNHI/AAAAAAAABiI/3B32kBI6Q4oD57-1hIabYXN3D_j9nScSwCLcBGAsYHQ/s400/Screen%2BShot%2B2020-04-10%2Bat%2B10.32.34%2BAM.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
New York City - New Cases By Day</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-O7CCPeYtHM4/XpCD3TmYjhI/AAAAAAAABiU/LrUHLXxoko4xv9XrAJiQQjuYOVE6Cl9eQCLcBGAsYHQ/s1600/Screen%2BShot%2B2020-04-10%2Bat%2B10.33.49%2BAM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="582" data-original-width="1600" height="145" src="https://1.bp.blogspot.com/-O7CCPeYtHM4/XpCD3TmYjhI/AAAAAAAABiU/LrUHLXxoko4xv9XrAJiQQjuYOVE6Cl9eQCLcBGAsYHQ/s400/Screen%2BShot%2B2020-04-10%2Bat%2B10.33.49%2BAM.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
New York City - Growth Rate of Cases</div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-EA_9Zr9RXhI/XpCEEEEsHII/AAAAAAAABiY/2SVFQFUCYWQEHNIBtYj9nNuxdzqbL4-LQCLcBGAsYHQ/s1600/Screen%2BShot%2B2020-04-10%2Bat%2B10.34.48%2BAM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="574" data-original-width="1600" height="142" src="https://1.bp.blogspot.com/-EA_9Zr9RXhI/XpCEEEEsHII/AAAAAAAABiY/2SVFQFUCYWQEHNIBtYj9nNuxdzqbL4-LQCLcBGAsYHQ/s400/Screen%2BShot%2B2020-04-10%2Bat%2B10.34.48%2BAM.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-F4zJi7brxoY/XpCEMgkcebI/AAAAAAAABig/bFifGkw1EL0xzqSCMl5cplFAMAvcxw9dwCLcBGAsYHQ/s1600/Screen%2BShot%2B2020-04-10%2Bat%2B10.35.17%2BAM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="571" data-original-width="1600" height="142" src="https://1.bp.blogspot.com/-F4zJi7brxoY/XpCEMgkcebI/AAAAAAAABig/bFifGkw1EL0xzqSCMl5cplFAMAvcxw9dwCLcBGAsYHQ/s400/Screen%2BShot%2B2020-04-10%2Bat%2B10.35.17%2BAM.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
New York City - Mortality Rate</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-rubzjGiWUPY/XpCEi7Hub0I/AAAAAAAABis/d2gWMLcGqmIsP5PsBGD8EitV6-Qt-jPNwCLcBGAsYHQ/s1600/Screen%2BShot%2B2020-04-10%2Bat%2B10.36.45%2BAM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="535" data-original-width="1600" height="132" src="https://1.bp.blogspot.com/-rubzjGiWUPY/XpCEi7Hub0I/AAAAAAAABis/d2gWMLcGqmIsP5PsBGD8EitV6-Qt-jPNwCLcBGAsYHQ/s400/Screen%2BShot%2B2020-04-10%2Bat%2B10.36.45%2BAM.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
New York State - Cumulative Cases By Day</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-YchCdMY3hxU/XpCFEQZ110I/AAAAAAAABi0/CJYKUQTGQZcyXBVB1dsXRzUJugNUJskLwCLcBGAsYHQ/s1600/Screen%2BShot%2B2020-04-10%2Bat%2B10.38.48%2BAM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="478" data-original-width="1600" height="118" src="https://1.bp.blogspot.com/-YchCdMY3hxU/XpCFEQZ110I/AAAAAAAABi0/CJYKUQTGQZcyXBVB1dsXRzUJugNUJskLwCLcBGAsYHQ/s400/Screen%2BShot%2B2020-04-10%2Bat%2B10.38.48%2BAM.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
New York State - New Cases By Day</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-6c0wNEs6zrY/XpCFZVB02RI/AAAAAAAABi8/sb5bTrnn4WAg2hZqNx1FeX6oWBLIGCd8QCLcBGAsYHQ/s1600/Screen%2BShot%2B2020-04-10%2Bat%2B10.40.22%2BAM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="582" data-original-width="1600" height="145" src="https://1.bp.blogspot.com/-6c0wNEs6zrY/XpCFZVB02RI/AAAAAAAABi8/sb5bTrnn4WAg2hZqNx1FeX6oWBLIGCd8QCLcBGAsYHQ/s400/Screen%2BShot%2B2020-04-10%2Bat%2B10.40.22%2BAM.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: left;">
New York State - Growth Rate of Cases</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-7R-SGAHaQc4/XpCF3V_dKTI/AAAAAAAABjI/OIkn-QrEY5YTOlBHG6KJ0gRlw2eAuik1ACLcBGAsYHQ/s1600/Screen%2BShot%2B2020-04-10%2Bat%2B10.42.26%2BAM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="585" data-original-width="1600" height="145" src="https://1.bp.blogspot.com/-7R-SGAHaQc4/XpCF3V_dKTI/AAAAAAAABjI/OIkn-QrEY5YTOlBHG6KJ0gRlw2eAuik1ACLcBGAsYHQ/s400/Screen%2BShot%2B2020-04-10%2Bat%2B10.42.26%2BAM.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-5_Sf8LUTbxE/XpCFt-zeHXI/AAAAAAAABjE/agIBThc684oVvSeVR1pQZELl5QRVPfv-ACLcBGAsYHQ/s1600/Screen%2BShot%2B2020-04-10%2Bat%2B10.41.34%2BAM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="581" data-original-width="1600" height="145" src="https://1.bp.blogspot.com/-5_Sf8LUTbxE/XpCFt-zeHXI/AAAAAAAABjE/agIBThc684oVvSeVR1pQZELl5QRVPfv-ACLcBGAsYHQ/s400/Screen%2BShot%2B2020-04-10%2Bat%2B10.41.34%2BAM.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
New York State - Mortality Rate</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-ikky9W5fmPw/XpCGLQKSSDI/AAAAAAAABjU/gQgno5-xMwUPwEXCkAGDMLj1O7xkSYargCLcBGAsYHQ/s1600/Screen%2BShot%2B2020-04-10%2Bat%2B10.43.44%2BAM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="530" data-original-width="1600" height="131" src="https://1.bp.blogspot.com/-ikky9W5fmPw/XpCGLQKSSDI/AAAAAAAABjU/gQgno5-xMwUPwEXCkAGDMLj1O7xkSYargCLcBGAsYHQ/s400/Screen%2BShot%2B2020-04-10%2Bat%2B10.43.44%2BAM.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
New York State - Percent of Cases By County</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-sIq89v1X87A/XpCGavdzWYI/AAAAAAAABjY/qOOevlKaQBQUDvpxmSMIiix1pewRvWNKACLcBGAsYHQ/s1600/Screen%2BShot%2B2020-04-10%2Bat%2B10.44.47%2BAM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="651" data-original-width="1600" height="162" src="https://1.bp.blogspot.com/-sIq89v1X87A/XpCGavdzWYI/AAAAAAAABjY/qOOevlKaQBQUDvpxmSMIiix1pewRvWNKACLcBGAsYHQ/s400/Screen%2BShot%2B2020-04-10%2Bat%2B10.44.47%2BAM.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
Joel Bernsteinhttp://www.blogger.com/profile/15670652646728608504noreply@blogger.comtag:blogger.com,1999:blog-447314081038493192.post-38982653784671401772017-11-27T19:20:00.001-08:002017-12-02T10:48:00.268-08:00Feature Scaling with Solr Streaming ExpressionsBefore performing machine learning operations its often important to scale the feature vectors so they can be compared at the same scale. In Solr 7.2 the Streaming Expression statistical function library provides a rich set of feature scaling functions that work on both <b>vectors</b> and <b>matrices</b>.<br />
<br />
This blog will describe the different feature scaling functions and provide examples to show how they differ from each other.<br />
<br />
<b>Min/Max Scaling</b><br />
<br />
The <i>minMaxScale</i> function scales a vector or matrix between a min and max value. By default it will scale between 0 and 1 if min/max values are not provided. Min/Max scaling is useful when comparing time series of different scales in machine learning algorithms such as k-means clustering.<br />
<br />
Below is the sample code for scaling a matrix:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">let(a=array(20, 30, 40, 50),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> b=array(200, 300, 400, 500),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> c=matrix(a, b),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> d=minMaxScale(c))</span><br />
<br />
The expression above creates two arrays at different scales. The arrays are then added to a matrix and scaled with the minMaxScale function.<br />
<br />
Solr responds with the vectors of the scaled matrix:<br />
<br />
<span style="font-family: monospace; font-size: 12px; white-space: pre;">{
"result-set": {
"docs": [
{
"d": [
[
0,
0.3333333333333333,
0.6666666666666666,
1
],
[
0,
0.3333333333333333,
0.6666666666666666,
1
]
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}</span><br />
<br />
Notice that once brought into the same scale the vectors are the same.<br />
<br />
<b>Standardizing</b><br />
<br />
<span style="font-family: inherit;">Standardizing scales a vector so that it has a mean of 0 and a standard deviation of 1. Standardization can be used with machine learning algorithms, such as SVM, that perform better when the data has a normal distribution. </span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">Standardization does lose information about the data if the underlying vectors don't fit a normal distribution. So use standardization with care.</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">Here is an example of how to standardize a matrix:</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">let(a=array(20, 30, 40, 50),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> b=array(200, 300, 400, 500),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> c=matrix(a, b),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> d=standardize(c))</span><br />
<br />
Solr responds with the vectors of the standardized matrix:<br />
<br />
<span style="font-family: monospace; font-size: 12px; white-space: pre;">{
"result-set": {
"docs": [
{
"d": [
[
-1.161895003862225,
-0.3872983346207417,
0.3872983346207417,
1.161895003862225
],
[
-1.1618950038622249,
-0.38729833462074165,
0.38729833462074165,
1.1618950038622249
]
]
},
{
"EOF": true,
"RESPONSE_TIME": 17
}
]
}
}</span><br />
<br />
<b>Unitizing</b><br />
<br />
<span style="font-family: inherit;">Unitizing scales vectors to a magnitude of 1. A vector with a magnitude of 1 is known as a unit vector. Unit vectors are </span>preferred<span style="font-family: inherit;"> when the vector math deals with vector direction rather than magnitude.</span><br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">let(a=array(20, 30, 40, 50),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> b=array(200, 300, 400, 500),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> c=matrix(a, b),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> d=unitize(c))</span><br />
<br />
Solr responds with the vectors of the unitized matrix:<br />
<br />
<span style="font-family: monospace; font-size: 12px; white-space: pre;">{
"result-set": {
"docs": [
{
"d": [
[
0.2721655269759087,
0.40824829046386296,
0.5443310539518174,
0.6804138174397716
],
[
0.2721655269759087,
0.4082482904638631,
0.5443310539518174,
0.6804138174397717
]
]
},
{
"EOF": true,
"RESPONSE_TIME": 6
}
]
}
}</span><br />
<br />
<b>Normalized Sum</b><br />
<br />
<span style="font-family: inherit;">The final feature scaling function is the </span><i style="font-family: inherit;">normalizeSum</i><span style="font-family: inherit;"> function which scales a vector so that it sums to a specific value. By default its scales the vector so that it sums to 1. This technique is useful when you want to convert vectors of raw counts to vectors of </span>probabilities.<span style="font-family: inherit;"> </span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;"></span><br />
Below is the sample code for applying the normalizeSum function:<br />
<div>
<br /></div>
<span style="font-family: "courier new" , "courier" , monospace;">let(a=array(20, 30, 40, 50),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> b=array(200, 300, 400, 500),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> c=matrix(a, b),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> d=normalizeSum(c))</span><br />
<br />
Solr responds with the vectors scaled to a sum of 1:<br />
<div>
<br /></div>
<span style="font-family: monospace; font-size: 12px; white-space: pre;">{
"result-set": {
"docs": [
{
"d": [
[
0.14285714285714285,
0.21428571428571427,
0.2857142857142857,
0.35714285714285715
],
[
0.14285714285714285,
0.21428571428571427,
0.2857142857142857,
0.35714285714285715
]
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}</span>Joel Bernsteinhttp://www.blogger.com/profile/15670652646728608504noreply@blogger.comtag:blogger.com,1999:blog-447314081038493192.post-60797997705498274102017-11-05T19:42:00.001-08:002017-11-07T19:27:54.773-08:00An Introduction to Markov Chains in Solr 7.2 This blog introduces Solr's new <b>Markov Chain</b> implementation coming in Solr 7.2. We'll first cover how Markov Chains work and then show how they are supported through the Streaming Expression statistical library.<br />
<br />
<b>Markov Chain</b><br />
<br />
A Markov Chain uses probabilities to model the <b>state transitions</b> of a process. A simple example taken from the Markov Chain <a href="https://en.wikipedia.org/wiki/Markov_chain">wiki</a> page will help illustrate this.<br />
<br />
In this example we'll be modeling the state transitions of the stock market. In our model the stock market can have three possible states:<br />
<ul>
<li><b>Bull</b></li>
<li><b>Bear</b></li>
<li><b>Stagnant</b></li>
</ul>
<div>
There are two important characteristics of a Markov Process:</div>
<div>
<br /></div>
<div>
1) The process can only be in <b>one state</b> at a time.<br />
<br /></div>
<div>
2) The probability of transitioning between states is based only on the current state. This is known as "<a href="https://en.wikipedia.org/wiki/Memorylessness">memorylessness</a>". </div>
<div>
<br /></div>
<div>
Below is a state diagram of the probabilities of transferring between the different states.</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://2.bp.blogspot.com/-FHkAQzq6hcI/Wf_COO4mb5I/AAAAAAAAAvE/x7NeGH3ScjEk5ORhiU3EKFm7VBA_ShjYwCLcBGAs/s1600/800px-Finance_Markov_chain_example_state_space.svg.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="593" data-original-width="800" height="237" src="https://2.bp.blogspot.com/-FHkAQzq6hcI/Wf_COO4mb5I/AAAAAAAAAvE/x7NeGH3ScjEk5ORhiU3EKFm7VBA_ShjYwCLcBGAs/s320/800px-Finance_Markov_chain_example_state_space.svg.png" width="320" /></a></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
In the diagram above, the lines show the probabilities of transitioning between the different states. For example there is .075 probability of transitioning from a <b>Bull</b> market to a <b>Bear</b> market. There is .025 probability of transitioning from a <b>Bull</b> market to a <b>Stagnant</b> market. There is a .9 probability of a <b>Bull</b> market transitioning to another <b>Bull</b> market.</div>
<div>
<br /></div>
<div>
The state transition probabilities in this example can be captured in a 3x3 matrix called a <b>transition matrix</b>. The transition matrix for this example is:</div>
<div>
</div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><b> Bull Bear Stagnant</b></span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><b>Bull</b> | .9 .075 .025 |</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><b>Bear</b> | .15 .8 .05 |</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><b>Stagnant</b> | .25 .25 .5 |</span></div>
<div>
<br /></div>
<div>
Notice each state has a row in the matrix. The values in the columns hold the transition probabilities for each state. </div>
<div>
<br /></div>
<div>
For example row 0, column 0 is the probability of the <b>Bull </b>market transitioning to another <b>Bull</b> market. Row 1, column 0 is the probability of the <b>Bear </b>market transitioning to a <b>Bull</b> market.</div>
<div>
<br /></div>
<div>
A Markov Chain uses the transition matrix to model and simulate the transitions in the process. A code example will make this more clear.</div>
<div>
<br /></div>
<div>
<b>Working with Matrices</b></div>
<div>
<br /></div>
<div>
In Solr 7.2 support for matrices has been added to the Streaming Expression statistical function library. Below is the expression for creating the example transition matrix:</div>
<div>
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">let(a=array(.9, .075, .025),</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> b=array(.15, .8, .05),</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> c=array(.25, .25, .5),</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> d=matrix(a, b, c))</span></div>
<div>
</div>
<div>
In the expression above the rows of the matrix are created as numeric arrays and set to variables <b><i>a</i></b>, <b><i>b</i></b> and <b><i>c</i></b>. Then the arrays are passed to the <b style="font-style: italic;">matrix </b>function to instantiate the matrix.</div>
<div>
</div>
<br />
<div>
If we send this expression to Solr's /stream handler it responds with:</div>
<div>
<br /></div>
<div>
<span style="font-family: monospace; font-size: 12px; white-space: pre;">{
"result-set": {
"docs": [
{
"d": [
[
0.9,
0.075,
0.025
],
[
0.15,
0.8,
0.05
],
[
0.25,
0.25,
0.5
]
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}</span></div>
<div>
<br /></div>
<div>
<b>Markov Chain Simulations</b></div>
<div>
<br /></div>
<div>
Once the transition matrix is created its very easy to create a Markov Chain and simulate the process. Here is a sample expression:</div>
<div>
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">let(a=array(.9, .075, .025),</span></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> b=array(.15, .8, .05),</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> c=array(.25, .25, .5),</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> d=matrix(a, b, c),</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> e=markovChain(d),</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> f=sample(e, 5))</span></div>
</div>
<br />
<div>
In the expression above the transition matrix is created and then passed as a parameter to<b style="font-style: italic;"> </b>the<b style="font-style: italic;"> markovChain</b> function. The markovChain function returns a Markov Chain for the specific transition matrix.</div>
<div>
<br /></div>
<div>
The Markov Chain can then be sampled using the <b><i>sample</i></b> function. In the example above 5 samples are taken from the Markov Chain. The samples represent the state that the process is in. This transition matrix has three states so each sample will either be <b><i>0</i></b> (Bull), <b><i>1 </i></b>(Bear) or <b><i>2</i></b> (Stagnant).</div>
<div>
<br /></div>
<div>
Each time the Markov Chain is sampled it returns the next state of the process based on the transition probabilities of its current state.</div>
<div>
<br /></div>
<div>
If we send this expression to the /stream handler it may respond with:</div>
<div>
<br /></div>
<div>
<span style="font-family: monospace; font-size: 12px; white-space: pre;">{
"result-set": {
"docs": [
{
"f": [
0,
0,
0,
2,
2
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}</span></div>
<br />
<div>
Notice that 5 samples were returned 0, 0, 0, 2, 2. This corresponds to three consecutive <b>Bull</b> states followed by two <b>Stagnant</b> states. </div>
<div>
<br /></div>
<div>
<b>Finding the Long Term Average of the States</b></div>
<div>
<br /></div>
<div>
By increasing the number of samples we can determine how much time the process will spend in each state over the long term. Here is an example expression:</div>
<div>
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">let(a=array(.9, .075, .025),</span></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> b=array(.15, .8, .05),</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> c=array(.25, .25, .5),</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> d=matrix(a, b, c),</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> e=sample(markovChain(d), 200000),</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> f=freqTable(e)</span><span style="font-family: "courier new" , "courier" , monospace;">)</span></div>
</div>
<div>
<br /></div>
<div>
Notice that now instead of 5 samples we are taking 200,000 samples. Then we are creating a frequency table from the simulation array using the <b><i>freqTable</i></b> function. This will tell us the percentage of time spent in each state.</div>
<div>
<br /></div>
<div>
If we send this expression to the /stream handler it responds with the breakdown of the frequency table:</div>
<div>
<br /></div>
<div>
<span style="font-family: monospace; font-size: 12px; white-space: pre;">{
"result-set": {
"docs": [
{
"f": [
{
"pct": 0.62636,
"count": 125272,
"cumFreq": 125272,
"cumPct": 0.62636,
"value": 0
},
{
"pct": 0.310705,
"count": 62141,
"cumFreq": 187413,
"cumPct": 0.937065,
"value": 1
},
{
"pct": 0.062935,
"count": 12587,
"cumFreq": 200000,
"cumPct": 1,
"value": 2
}
]
},
{
"EOF": true,
"RESPONSE_TIME": 56
}
]
}
}</span></div>
<div>
<br /></div>
<div>
Notice in the response above that there are three tuples returned in the frequency table, one for each state (Bull, Bear, Stagnant).</div>
<div>
<br /></div>
<div>
The <i style="font-weight: bold;">value </i>field in each tuple is the numeric mapping of the state (0=Bull, 1=Bear, 2=Stagnant).</div>
<div>
<br /></div>
<div>
The <b><i>pct</i></b> field in each tuple is the percentage of time the value appears in the sample set. We can see that the process is in a Bull market 62.6% of the time, Bear market 31% and Stagnant 6.3% of the time.</div>
Joel Bernsteinhttp://www.blogger.com/profile/15670652646728608504noreply@blogger.comtag:blogger.com,1999:blog-447314081038493192.post-85691939649260713252017-10-10T19:32:00.000-07:002017-10-13T16:24:53.039-07:00A Gentle Introduction to Monte Carlo Simulations in Solr 7.1Monte Carlo simulations have been added to Streaming Expressions in Solr 7.1. This blog provides a gentle introduction to the topic of Monte Carlo simulations and shows how they are supported with the Streaming Expressions statistical function library.<br />
<br />
<b>Probability Distributions</b><br />
<b><br /></b>
Before diving into Monte Carlo simulations I'll briefly introduce Solr's probability distribution framework. We'll start slowly and cover just enough about probability distributions to support the Monte Carlo examples. Future blogs will go into more detail about Solr's probability distribution framework.<br />
<br />
First let's start with a definition of what a probability distribution is. A probability distribution is a function which describes the probability of a random variable within a data set.<br />
<br />
A simple example will help clarify the concept.<br />
<br />
<b>Uniform Integer Distribution</b><br />
<br />
One commonly used probability distribution is the <i>uniform integer distribution</i>.<br />
<br />
The uniform integer distribution is a function that describes a theoretical data set that is randomly distributed over a range of integers.<br />
<br />
With the Streaming Expression statistical function library you can create a uniform integer distribution with the following function call:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">uniformIntegerDistribution(1,6) </span><br />
<br />
The function above returns a uniform integer distribution with a range of 1 to 6. <br />
<br />
<b>Sampling the Distribution</b><br />
<br />
The uniformIntegerDistribution function returns the mathematical model of the distribution. We can draw a random sample from the model using the <b><i>sample</i></b> function.<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">let(a=uniformIntegerDistribution(1, 6),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> b=sample(a))</span><br />
<br />
In the example above the <b><i>let</i></b> expression is setting two variables:<br />
<ul>
<li><i style="font-weight: bold;">a </i>is set to output of the uniformIntegerDistribtion function, which is returning the uniform integer distribution model.</li>
<li><i style="font-weight: bold;">b </i>is set to the output of the <b><i>sample</i></b> function which is returning a single random sample from the distribution.</li>
</ul>
<div>
Solr returns the following result from the expression above:</div>
<div>
<span style="font-family: monospace; font-size: 12px; white-space: pre;"><br /></span></div>
<div>
<span style="font-family: monospace; font-size: 12px; white-space: pre;">{
"result-set": {
"docs": [
{
"b": 4
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}</span></div>
<div>
<br /></div>
<div>
Notice in the output above the variable b = 4. 4 is the random sample taken from the uniform integer distribution. </div>
<div>
<br /></div>
<div>
<b>The Monte Carlo Simulation</b></div>
<div>
<br /></div>
<div>
We now know enough about probability distributions to run our first Monte Carlo simulation.</div>
<div>
<br /></div>
<div>
For our first simulation we are going to simulate the rolling of a pair of six sided dice.<br />
<br /></div>
<div>
Here is the code:</div>
<div>
<br /></div>
<span style="font-family: "courier new" , "courier" , monospace;">let(a=uniformIntegerDistribution(1, 6),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> b</span><span style="font-family: "courier new" , "courier" , monospace;">=uniformIntegerDistribution(1, 6),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> c=monteCarlo(add(sample(a), </span><span style="font-family: "courier new" , "courier" , monospace;">sample(b)), </span><span style="font-family: "courier new" , "courier" , monospace;">10))</span><br />
<br />
<div>
The expression above is setting three variables:</div>
<div>
<ul>
<li><i style="font-weight: bold;">a </i>is set to a uniform integer distribution with a range of 1 to 6.</li>
<li><b><i>b</i></b> is also set to a uniform integer distribution with a range of 1 to 6.</li>
<li><b><i>c</i></b> is set to the outcome of the monteCarlo function.</li>
</ul>
</div>
<div>
The monteCarlo function runs a function a specified number of times and collects the outputs into an array and then returns the array.</div>
<div>
<br /></div>
<div>
In the example above the function <span style="font-family: "courier new" , "courier" , monospace;">add(sample(a), </span><span style="font-family: "courier new" , "courier" , monospace;">sample(b)) </span>is run 10 times.</div>
<div>
<br /></div>
<div>
Each time the function is called, a sample is drawn from the distribution models stored in the variables <b><i>a</i></b> and <b><i>b.</i></b> The two random samples are then added together.<br />
<br />
Each run simulates rolling a pair of dice. The results of the 10 rolls are gathered into an array and returned.</div>
<div>
<br /></div>
<div>
The output from the expression above looks like this:</div>
<div>
<br /></div>
<div>
<span style="font-family: monospace; font-size: 12px; white-space: pre;">{
"result-set": {
"docs": [
{
"c": [
6,
6,
8,
8,
9,
7,
6,
8,
7,
6
]
},
{
"EOF": true,
"RESPONSE_TIME": 1
}
]
}
}</span></div>
<br />
<b>Counting the Results with a Frequency Table</b><br />
<br />
The results of the dice simulation can be analyzed using a frequency table:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">let(a=uniformIntegerDistribution(1, 6),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> b=uniformIntegerDistribution(1, 6),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> c=monteCarlo(add(sample(a), sample(b)), 100000), </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> d=freqTable(c))</span><br />
<div style="-webkit-text-stroke-width: 0px; color: black; font-size: medium; font-variant-caps: normal; font-variant-ligatures: normal; letter-spacing: normal; orphans: 2; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px;">
<div style="font-style: normal; font-weight: normal; margin: 0px;">
<br /></div>
<div style="font-family: Times;">
Now we are running the simulation 100,000 times rather 10. We are then using the <b><i>freqTable</i></b> function to count the frequency of each value in the array. </div>
<div style="font-family: Times; font-style: normal; font-weight: normal;">
<br /></div>
<div style="font-family: Times; font-style: normal; font-weight: normal;">
<a href="http://joelsolr.blogspot.com/2017/08/a-first-look-at-sunplot-statistical.html">Sunplot</a> provides a nice table view of the frequency table. <span style="font-family: "times";">The frequency table below shows the</span><span style="font-family: "times";"> </span><b style="font-family: Times;">percent</b><span style="font-family: "times";">,</span><span style="font-family: "times";"> </span><b style="font-family: Times;">count</b><span style="font-family: "times";">,</span><span style="font-family: "times";"> </span><b style="font-family: Times;">cumulative frequency</b><span style="font-family: "times";"> </span><span style="font-family: "times";">and</span><span style="font-family: "times";"> </span><b style="font-family: Times;">cumulative percent</b><span style="font-family: "times";"> </span><span style="font-family: "times";">for each</span><span style="font-family: "times";"> </span><b style="font-family: Times;">value (2-12)</b><span style="font-family: "times";"> </span><span style="font-family: "times";">in the simulation array.</span><br />
<div style="font-family: Times;">
<br /></div>
</div>
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://2.bp.blogspot.com/-_CJy2lO8Kmk/Wd0YxcsowNI/AAAAAAAAAso/csHfHMf3KGcpeRkv1mmObR4avWg-UFWiACLcBGAs/s1600/Screen%2BShot%2B2017-10-10%2Bat%2B2.59.52%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="697" data-original-width="1600" height="139" src="https://2.bp.blogspot.com/-_CJy2lO8Kmk/Wd0YxcsowNI/AAAAAAAAAso/csHfHMf3KGcpeRkv1mmObR4avWg-UFWiACLcBGAs/s320/Screen%2BShot%2B2017-10-10%2Bat%2B2.59.52%2BPM.png" width="320" /></a></div>
<br />
<br />
<b>Plotting the Results</b><br />
<br />
Sunplot can also be used to plot specific columns from the frequency table.<br />
<br />
In plot below the <b style="font-style: italic;">value </b><span style="font-style: italic;">column</span><b style="font-style: italic;"> </b>(2-12) from the frequency table is plotted on the <b><i>x</i></b> axis. The <b><i>pct</i></b> column (percent) from the frequency table is plotted on the <b><i>y</i></b> axis.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-GHrYoeEK3cY/Wd0bZlSfD1I/AAAAAAAAAs0/060OwcbWgdwuGsDtJ20jta0RZ-xo7EufACLcBGAs/s1600/Screen%2BShot%2B2017-10-10%2Bat%2B3.11.08%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="871" data-original-width="1600" height="174" src="https://3.bp.blogspot.com/-GHrYoeEK3cY/Wd0bZlSfD1I/AAAAAAAAAs0/060OwcbWgdwuGsDtJ20jta0RZ-xo7EufACLcBGAs/s320/Screen%2BShot%2B2017-10-10%2Bat%2B3.11.08%2BPM.png" width="320" /></a></div>
<br />
<br />
Below is the plotting expression:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">let(a=uniformIntegerDistribution(1, 6),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> b=uniformIntegerDistribution(1, 6),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> c=monteCarlo(add(sample(a), sample(b)), 100000),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> d=freqTable(c),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;">x=col(d, value),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;">y=col(d, pct),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> plot(type=bar, x=x, y=y)) </span><br />
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
Notice that the <b><i>x</i></b> and <i style="font-weight: bold;">y </i>variables are set using the <b style="font-style: italic;">col </b>function. The <b><i>col </i></b>function moves a field from a list of tuples into an array. In this case it's moving the the <b><i>value</i></b> and <b><i>pct</i></b> fields from the frequency table tuples into arrays.</div>
<div>
<br /></div>
<div>
We've just completed our first Monte Carlo simulation and plotted the results. As a bonus we've learned the probabilities of a craps game!</div>
<br />
<b>Simulations with Real World Data</b><br />
<br />
The example above is using a <b>theoretical probability distribution</b>. There are many different theoretical distributions used in different fields. The first release of Solr's probability distribution framework includes some of the best known distributions including: the normal, log normal, poisson, uniform, binomial, gamma, beta, Wiebull and ZipF distributions.<br />
<br />
Each of these distributions are designed to model a particular theoretical data set.<br />
<br />
Solr also provides an <b><i>empirical distribution</i></b> function which builds a mathematical model based only on actual data. Empirical distributions can be sampled in exactly the same way as theoretical distributions. This means we can mix and match <b><i>empirical distributions</i></b> and<b><i> theoretical distributions</i></b> in Monte Carlo simulations.<br />
<br />
Let's take a very brief look at a Monte Carlo simulation using empirical distributions pulled from Solr Cloud collections.<br />
<br />
In this example we are building a new product which is made up of <b>steel</b> and <b>plastic</b>. Both steel and plastic are bought by the ton on the open market. We have historical pricing data for both steel and plastic and we want to simulate the unit costs based on the historical data.<br />
<br />
Here is our simulation expression:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">let(a=random(steel, q="*:*", fl="price", rows="2000"),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> b=random(plastic, </span><span style="font-family: "courier new" , "courier" , monospace;">q="*:*", fl="price", rows="2000"</span><span style="font-family: "courier new" , "courier" , monospace;">)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> c=col(a, price),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> d=col(b, price),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> steel=empiricalDistribtion(c),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> plastic=empiricalDistribtion(d),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> e=monteCarlo(add(mult(sample(steel), .0005), </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> mult(sample(plastic), .0021)), </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 100000),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> f=hist(e)</span><span style="font-family: "courier new" , "courier" , monospace;">) </span><br />
<br />
In the example above the <b><i>let</i></b> expression is setting the following variables:<br />
<br />
<ul>
<li><i style="font-weight: bold;">a </i>is set to the output of the <b><i>random</i></b> function. The random function is retrieving 2000 random tuples from the Solr Cloud collection containing steel prices.</li>
</ul>
<ul>
<li><i style="font-weight: bold;">b </i>is set to the output of the <b><i>random</i></b> function. The random function is retrieving 2000 random tuples from the Solr Cloud collection containing plastic prices.</li>
</ul>
<ul>
<li><i style="font-weight: bold;">c </i>is set to the output of the <b><i>col</i></b> function, which is copying the <b><i>price</i></b> field from the tuples stored in variable <b><i>a</i></b> to an array. This is an array of <b><i>steel</i></b> prices.</li>
</ul>
<ul>
<li><b><i>d </i></b>is set to the output of the <b><i>col</i></b> function, which is copying the <b><i>price</i></b> field from the tuples stored in variable <b><i>b</i></b> to an array. This is an array of <b><i>plastic</i></b> prices.</li>
</ul>
<ul>
<li>The <b><i>steel</i></b> variable is set to the output of the empiricalDistribution function, which is creating an empirical distribution from the array of <b><i>steel</i></b> prices.</li>
</ul>
<ul>
<li>The <b><i>plastic</i></b> variable is set to the output of the empiricalDistribution function, which is creating an empirical distribution from the array of <b><i>plastic </i></b>prices.</li>
</ul>
<ul>
<li><b><i>e</i></b> is set to the output of the<b><i> monteCarlo</i></b> function. The <b><i>monteCarlo</i></b> function runs the function with the formula for unit costs of steel and plastic. Random samples from the empirical distributions for steel and plastic are pulled for each run.</li>
</ul>
<ul>
<li><b>f </b>is set to the output of the <b>hist</b> function. The hist function returns the histogram of the output from the pricing simulation. A histogram is used instead of the frequency table when dealing with floating point data.</li>
</ul>
Joel Bernsteinhttp://www.blogger.com/profile/15670652646728608504noreply@blogger.comtag:blogger.com,1999:blog-447314081038493192.post-80852258821900873812017-10-05T12:17:00.001-07:002017-11-05T13:22:36.761-08:00How to Model and Remove Time Series Seasonality With Solr 7.1Often when working with time series data there is a cycle that repeats periodically. This periodic cycle is referred to as seasonality. Seasonality may have a large enough effect on the data that it makes it difficult to study other features of the time series. In Solr 7.1 there are new Streaming Expression statistical functions that allow us to model and remove time series seasonality.<br />
<br />
If you aren't familiar with Streaming Expressions new statistical programming functions you may find it useful to read a few of the earlier blogs which introduce the topic.<br />
<br />
<ul>
<li><a href="http://joelsolr.blogspot.com/2017/05/statistical-programming-with-solr.html">Statistical programming with Solr Streaming Expressions</a></li>
<li><a href="http://joelsolr.blogspot.com/2017/07/one-way-anova-and-rank-transformation.html">One-way ANOVA and Rank Transformation with Solr's Streaming Expressions</a></li>
<li><a href="http://joelsolr.blogspot.com/2017/08/a-first-look-at-sunplot-statistical.html">A first look at Sunplot, a statistical plotting engine for Solr Streaming Expressions</a></li>
<li><a href="http://joelsolr.blogspot.com/2017/07/detrending-time-series-data-with-linear.html">Detrending Time Series Data With Linear Regression in Solr 7</a></li>
<li><a href="http://joelsolr.blogspot.com/2017/08/time-series-cross-correlation-and.html">Time Series Cross-correlation and Lagged Regression With Solr Streaming Expressions</a></li>
</ul>
<br />
<b>Seasonality</b><br />
<br />
Often seasonality appears in the data as periodic bumps or waves. These waves can be expressed as sine-waves. For this example we'll start off by generating some smooth sine-waves to represent seasonality. We'll be using Solr's statistical functions to generate the data and Sunplot to plot the sine-waves.<br />
<br />
Here is a sample plot using Sunplot:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-30vPnMVS43Y/WdZuIPZGA7I/AAAAAAAAArY/uFic3ZEQ_TAn5MNFnEZl7VdMIWasqO9gQCLcBGAs/s1600/Screen%2BShot%2B2017-10-05%2Bat%2B1.38.04%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="859" data-original-width="1600" height="171" src="https://1.bp.blogspot.com/-30vPnMVS43Y/WdZuIPZGA7I/AAAAAAAAArY/uFic3ZEQ_TAn5MNFnEZl7VdMIWasqO9gQCLcBGAs/s320/Screen%2BShot%2B2017-10-05%2Bat%2B1.38.04%2BPM.png" width="320" /></a></div>
<br />
<br />
In the plot you'll notice there are waves in the data occurring at regular intervals. These waves represent the seasonality.<br />
<br />
The expression used to generate the sine-waves is:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">let(smooth=sin(sequence(100, 0, 6)),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> plot(type=line, y=smooth)) </span> <br />
<br />
In the function above the <b><i>let</i></b> expression is setting a single variable called <b><i>smooth</i></b>. The value set to smooth is an array of numbers generated by the <b><i>sequence</i></b> function that is wrapped and transformed by the <b><i>sin</i></b> function. <br />
<br />
Then the <b><i>let</i></b> function runs the<b><i> plot</i></b> function with the <b><i>smooth</i></b> variable as the y axis. Sunplot then plots the data.<br />
<br />
This sine-wave is perfectly smooth so the entire time series consists only of seasonality. To make things more interesting we can add some noise to the sign-waves to represent another component of the time series.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://2.bp.blogspot.com/-dOn1PPC8r58/WdZxrM7h2iI/AAAAAAAAArk/1HEmsAWhBAwx8BP55KL2YI6TT4hmP0ZKACLcBGAs/s1600/Screen%2BShot%2B2017-10-05%2Bat%2B1.53.09%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="868" data-original-width="1600" height="173" src="https://2.bp.blogspot.com/-dOn1PPC8r58/WdZxrM7h2iI/AAAAAAAAArk/1HEmsAWhBAwx8BP55KL2YI6TT4hmP0ZKACLcBGAs/s320/Screen%2BShot%2B2017-10-05%2Bat%2B1.53.09%2BPM.png" width="320" /></a></div>
<br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "times" , "times new roman" , serif;">Now the expression looks like this:</span><br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">let(smooth=sin(sequence(100, 0, 6)),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> noise=sample(uniformDistribution(-.25,.25),100),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> noisy=ebeAdd(smooth, noise), </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> plot(type=line, y=noisy)) </span><br />
<br />
<br />
In the expression above we first generate the smooth sine-wave and set it to the variable <b><i>smooth</i></b>. Then we generate some random noise by taking a sample from a uniform distribution. The random samples will be uniformly distributed between -.25 and .25. The variable<b style="font-style: italic;"> noise </b>holds the array of random noise data.<br />
<br />
Then the <b><i>smooth</i></b> and <b><i>noise</i></b> arrays are added together using the <b><i>ebeAdd</i></b> function. The <b><i>ebeAdd</i></b> function does an element-by-element addition of the two arrays and outputs an array with the results. This will add the noise to the sine-wave. The variable <b><i>noisy</i></b> holds this new noisy array of data.<br />
<br />
The <b><i>noisy</i></b> array is then set to the y axis of the plot.<br />
<br />
Now we have a time series that has both a seasonality component and noisy signal. Let's see how we can model and remove the seasonality so we can study the noisy component.<br />
<br />
<b>Modeling Seasonality </b><br />
<br />
We can model the seasonality using the new <b><i>polyfit</i></b> function to fit a curve to the data. The <b><i>polyfit </i></b>function is a <b>polynomial curve fitter </b>which builds a function that models non-linear data.<br />
<br />
Below is a screenshot of the polyfit function:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-9hkGcLzdsm8/WdZ4B_3N9bI/AAAAAAAAAr0/WREYYzp5I9gKTgdF7qFwK263_SrMIBPwgCLcBGAs/s1600/Screen%2BShot%2B2017-10-05%2Bat%2B2.19.58%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="867" data-original-width="1600" height="173" src="https://1.bp.blogspot.com/-9hkGcLzdsm8/WdZ4B_3N9bI/AAAAAAAAAr0/WREYYzp5I9gKTgdF7qFwK263_SrMIBPwgCLcBGAs/s320/Screen%2BShot%2B2017-10-05%2Bat%2B2.19.58%2BPM.png" width="320" /></a></div>
<br />
<br />
Notice that now there is a smooth red curve which models the noisy time series. This is the smooth curve that the polyfit function fit to the noisy time series.<br />
<br />
Here is the expression:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">let(smooth=sin(sequence(100, 0, 6)),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> noise=sample(uniformDistribution(-.25,.25),100),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> noisy=ebeAdd(smooth,noise), </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> fit=polyfit(noisy, 16),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> x=sequence(100,0,1), </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> list(tuple(plot=line, x=x, y=noisy),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> tuple(plot=line, x=x, y=fit))) </span><br />
<br />
In the expression above we first build the noisy time series. Then the <b><i>polyfit</i></b> function is called on the noisy array with a polynomial degree. The degree describes the exponent size of the polynomial used in the curve fitting function. As the degree rises the function has more flexibility in the curves that it can model. For example a degree of 1 provides a linear model. You can try different degrees until you find the one that best fits your data set. In this example a 16 degree polynomial is used to fit the sine-wave.<br />
<br />
Notice that when plotting two lines we use a slightly different plotting syntax. In the syntax above a list of output tuples is used to define the plot for Sunplot. When plotting two plots an <i><b>x</b></i> axis must be provided. The <b><i>sequence</i></b> function is used to generate an <b><i>x</i></b> axis.<br />
<br />
<b>Removing the Seasonality</b><br />
<br />
Once we've fit a curve to the time series we can subtract it away to remove the seasonality. After the subtraction what's left is the noisy signal that we want to study.<br />
<br />
Below is a screenshot showing the subtraction of the fitted curve:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-dPo71RvqP8Q/WdZ9TY87UOI/AAAAAAAAAsE/mDdMa2-IE8AOP22UiZctBfqs0vMp1XwVwCLcBGAs/s1600/Screen%2BShot%2B2017-10-05%2Bat%2B2.42.46%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="862" data-original-width="1600" height="172" src="https://1.bp.blogspot.com/-dPo71RvqP8Q/WdZ9TY87UOI/AAAAAAAAAsE/mDdMa2-IE8AOP22UiZctBfqs0vMp1XwVwCLcBGAs/s320/Screen%2BShot%2B2017-10-05%2Bat%2B2.42.46%2BPM.png" width="320" /></a></div>
<br />
Notice that the plot now shows the data that remains after the seasonality has been removed. This time series is now ready to be studied without the effects of the seasonality.<br />
<br />
Here is the expressions:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">let(smooth=sin(sequence(100, 0, 6)),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> noise=sample(uniformDistribution(-.25,.25),100),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> noisy=ebeAdd(smooth,noise), </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> fit=polyfit(noisy, 16),</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> stationary=ebeSubtract(noisy, fit), </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> plot(type=line, y=stationary)) </span><br />
<br />
In the expression above the <i style="font-weight: bold;">fit </i>array, which holds the fitted curve, is subtracted from the noisy array. The <b><i>ebeSubtrac</i></b>t function performs the element-by-element subtraction. The new time series with the seasonality removed is stored in the<b><i> stationary</i></b> variable and plotted on the <b><i>y</i></b> axis.<br />
<br />Joel Bernsteinhttp://www.blogger.com/profile/15670652646728608504noreply@blogger.comtag:blogger.com,1999:blog-447314081038493192.post-29563746962903653502017-08-06T19:55:00.000-07:002017-08-07T06:19:41.218-07:00Time Series Cross-correlation and Lagged Regression With Solr Streaming Expresssions<div class="separator" style="clear: both; text-align: left;">
One of the more interesting capabilities in Solr's new statistical library is <b>cross-correlation</b>. But before diving into cross-correlation, let's start by describing <b>correlation</b>. Correlation measures the extent that two variables fluctuate together. For example if the rise of <b>stock A</b> typically coincides with a rise in <b>stock B </b>they are positively correlated. If a rise in <b>stock A</b> typically coincides with a fall in <b>stock B</b> they are negatively correlated.</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
When two variables are highly correlated it may be possible to <b>predict</b> the value of one variable based on the value of the other variable. A technique called <b>simple regression</b> can be used to describe the linear relationship between two variables and provide a prediction formula.</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
Sometimes there is a time lag in the correlation. For example, if <b>stock A</b> rises and three days later <b>stock B </b>rises then there is a 3 day lag time in the correlation. </div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
We need to account for this lag time before we can perform a regression analysis.</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<b>Cross-correlation</b> is a tool for discovering the lag time in correlation between two time series. Once we know the lag time we can account for it in our regression analysis using a technique known as <b>lagged regression</b>. </div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<b><span style="font-size: large;">Working With Sine Waves</span></b></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
This blog will demonstrate cross-correlation using simple sine waves. The same approach can be used on time series waveforms generated from data stored in Solr collections. </div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
The screenshot below shows how to generate and plot a sine wave:</div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://4.bp.blogspot.com/-lSeHDTXEnBw/WYXyxaz4YLI/AAAAAAAAAmE/dlB70QU0RG4ffUqMZqpiA1Y7uSnurEEiQCLcBGAs/s1600/Screen%2BShot%2B2017-08-05%2Bat%2B12.30.27%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="910" data-original-width="1600" height="181" src="https://4.bp.blogspot.com/-lSeHDTXEnBw/WYXyxaz4YLI/AAAAAAAAAmE/dlB70QU0RG4ffUqMZqpiA1Y7uSnurEEiQCLcBGAs/s320/Screen%2BShot%2B2017-08-05%2Bat%2B12.30.27%2BPM.png" width="320" /></a></div>
<b><br /></b>
Let's break down what the expression is doing.<br />
<br />
let(a=sin(sequence(100, 1, 6)),<br />
plot(type=line, y=a))<br />
<br />
<ol>
<li>The <b><i>let </i></b>expression is setting the variable <b><i>a</i></b> and then calling the <b><i>plot</i></b> function.</li>
<li>Variable<b><i> a</i></b> holds the output of the <b><i>sin</i></b> function which is wrapped around a <b><i>sequence</i></b> function. The sequence function creates a sequence of 100 numbers, starting from 1 with a stride of 6. The <b><i>sin</i></b> function wraps the sequence array and converts it to a sine wave by calling the trigonometric sine function on each element in the array.</li>
<li>The <b><i>plot</i></b> function plots a line using the array in variable <b><i>a</i></b> as the y axis.</li>
</ol>
<b><br /></b>
<b><span style="font-size: large;">Adding a Second Sine Wave</span></b><br />
<br />
To demonstrate cross-correlation we'll need to plot a second sine wave and create a lag between the two waveforms.<br />
<br />
The screenshot below shows the statistical functions for adding and plotting the second sine wave.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://4.bp.blogspot.com/-45PbKPMmc7Q/WYfXzwguIPI/AAAAAAAAAnU/Isv6VMnngpIuhU6gFqGiJlQULJ8-ZpC2gCLcBGAs/s1600/Screen%2BShot%2B2017-08-06%2Bat%2B10.59.53%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="853" data-original-width="1600" height="170" src="https://4.bp.blogspot.com/-45PbKPMmc7Q/WYfXzwguIPI/AAAAAAAAAnU/Isv6VMnngpIuhU6gFqGiJlQULJ8-ZpC2gCLcBGAs/s320/Screen%2BShot%2B2017-08-06%2Bat%2B10.59.53%2BPM.png" width="320" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
Let's explore the statistical expression:<br />
<br />
let(a=sin(sequence(100, 1, 6)),<br />
b=copyOfRange(a, 5, 100),<br />
x=sequence(100, 0, 1),<br />
list(tuple(plot=line, x=x, y=a),<br />
tuple(plot=line, x=x, y=b)))<br />
<br />
<ol>
<li>The <b>let</b> expression is setting variable <b>a, b, x</b> and returning a list of tuples with plotting data.</li>
<li>Variable <b><i>a</i></b> holds the data for the first sine wave.</li>
<li>Variable <b><i>b</i></b> has a copy of the array stored in variable <b><i>a</i></b> starting from index 5. Starting the second sine wave from the 5th index creates the lag time between the two sine waves. </li>
<li>Variable <b><i>x</i></b> holds a sequence from 0 to 99 which will be used for plotting the <b><i>x</i></b> access.</li>
<li>The <i style="font-weight: bold;">list </i>contains two output tuples which provide the plotting data. You'll notice that the syntax for plotting two lines does not involve the <b><i>plot</i></b> function, but instead requires a list of tuples containing plotting data. As Sunplot matures the syntax for plotting a single line and multiple lines will likely converge. </li>
</ol>
<br />
<b><span style="font-size: large;">Convolution and Cross-correlation</span></b><br />
<br />
We're going to be using the math behind convolution to cross-correlate the two waveforms. So before delving into cross-correlation its worth having a discussion about convolution.<br />
<br />
Convolution is a mathematical operation that has a wide number of uses. In the field of Digital Signal Processing (DSP) convolution is considered the most important function. Convolution is also a key function in deep learning where it's used in <b>convolutional neural networks</b>.<br />
<br />
So what is convolution? Convolution takes two waveforms and produces a third waveform through a mathematical operation. The gist of the operation is to <b>reverse</b> one of the waveforms and slide it across the other waveform. As the waveform is slid across the other, a cross product is calculated at each position. The integral of the cross product at each position is stored in a new array which is the "convolution" of the two waveforms.<br />
<br />
That's all very interesting, but what does it have to do with cross-correlation? Well as it turns out convolution and cross-correlation are very closely related. The only difference between convolution and cross-correlation is that the waveform being slid across is <b>not reversed.</b><br />
<b><br /></b>
In the example below the convolve function (<b><i>conv</i></b>) is called on two waveforms. Notice that the second waveform is <b>reversed</b> with the <b><i>rev</i></b> function before the convolution. This is done because the convolution operation will reverse the second waveform. Since it's already been reversed the convolution function will reverse it again and work with the original waveform.<br />
<br />
This will result in a cross-correlation operation rather then convolution.<br />
<br />
The screenshot below shows the cross-correlation operation and it's plot.<br />
<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-MTdVDOLBbzY/WYX0Qqy-sgI/AAAAAAAAAmU/NdM9MswVbkUV-IqoXDeQJSs2cECtXgIIQCLcBGAs/s1600/Screen%2BShot%2B2017-08-05%2Bat%2B12.36.53%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="971" data-original-width="1600" height="194" src="https://3.bp.blogspot.com/-MTdVDOLBbzY/WYX0Qqy-sgI/AAAAAAAAAmU/NdM9MswVbkUV-IqoXDeQJSs2cECtXgIIQCLcBGAs/s320/Screen%2BShot%2B2017-08-05%2Bat%2B12.36.53%2BPM.png" width="320" /></a></div>
<br />
The <b>highest</b> peak in the cross-correlation plot is the point where the two waveforms have the highest correlation.<br />
<br />
<b><span style="font-size: large;">Finding the Delay Between Two Time Series</span></b><br />
<br />
We've visualized the cross-correlation, but how do we use the cross-correlation array to find the delay? We actually have a function called <b>finddelay </b>which will calculate the delay for us. The finddelay function uses convolution math to calculate the cross-correlation array. But instead of returning the cross-correlation array it takes it a step further and calculates the delay.<br />
<br />
The screenshot below shows how the finddelay function is called.<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-3zVCPQEkFSw/WYX0-ymYyZI/AAAAAAAAAmc/2zAKjlhM4LkKfw9iot9i1KZjkU2L5O-bgCLcBGAs/s1600/Screen%2BShot%2B2017-08-05%2Bat%2B12.40.05%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="199" src="https://1.bp.blogspot.com/-3zVCPQEkFSw/WYX0-ymYyZI/AAAAAAAAAmc/2zAKjlhM4LkKfw9iot9i1KZjkU2L5O-bgCLcBGAs/s320/Screen%2BShot%2B2017-08-05%2Bat%2B12.40.05%2BPM.png" width="320" /></a></div>
<br />
<b><span style="font-size: large;"><br /></span></b>
<b><span style="font-size: large;">Lagged Regression</span></b><br />
<br />
Once we know the delay between the two sine waves it's very easy to perform the lagged regression. The screenshot below shows the statistical expression and regression result.<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://4.bp.blogspot.com/-hD89lKJn9k8/WYfL6Xcb5JI/AAAAAAAAAnE/N0d0dGk2cJoxGPXE2K5Axd0zkC8DJmTXgCLcBGAs/s1600/Screen%2BShot%2B2017-08-06%2Bat%2B10.08.57%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="884" data-original-width="1600" height="176" src="https://4.bp.blogspot.com/-hD89lKJn9k8/WYfL6Xcb5JI/AAAAAAAAAnE/N0d0dGk2cJoxGPXE2K5Axd0zkC8DJmTXgCLcBGAs/s320/Screen%2BShot%2B2017-08-06%2Bat%2B10.08.57%2BPM.png" width="320" /></a></div>
<br />
<br />
Let's quickly review the expression and interpret the regression results:<br />
<br />
let(a=sin(sequence(100, 1, 6)),<br />
b=copyOfRange(a, 5, 100),<br />
c=finddelay(a, b), <br />
d=copyOfRange(a, c, 100),<br />
r=regress(b, d), <br />
tuple(reg=r))<br />
<br />
<ol>
<li>Variables <b><i>a</i></b> and <i style="font-weight: bold;">b </i>hold the two sine waves with the 5 increment lag time between them.</li>
<li>Variable <b><i>c</i></b> holds the delay between the two signals.</li>
<li>Variable <i style="font-weight: bold;">d </i>is a copy of the first sine wave starting from the <b>delay index</b> specified in variable <b><i>c</i></b>. </li>
</ol>
<br />
The sine waves in variables <b><i>b</i></b> and <b><i>d</i></b> are now in sync and ready to regress.<br />
<br />
The regression result is as follows:<br />
<br />
<span style="font-family: monospace; font-size: 12px; white-space: pre;"> {
"reg": {
"significance": 0,
"totalSumSquares": 48.42686366058407,
"R": 1,
"meanSquareError": 0,
"intercept": 0,
"slopeConfidenceInterval": 0,
"regressionSumSquares": 48.42686366058407,
"slope": 1,
"interceptStdErr": 0,
"N": 95,
"RSquare": 1
}
}</span><br />
<br />
<span style="font-family: "times" , "times new roman" , serif; white-space: pre;">The RSquare value of 1 indicates that the regression equation perfectly describes the linear relationship</span><br />
<span style="font-family: "times" , "times new roman" , serif; white-space: pre;">between the two arrays. </span>Joel Bernsteinhttp://www.blogger.com/profile/15670652646728608504noreply@blogger.comtag:blogger.com,1999:blog-447314081038493192.post-31297975117297289622017-08-01T19:40:00.000-07:002017-08-02T18:09:50.102-07:00A first look at Sunplot, a statistical plotting engine for Solr Streaming Expressions<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<h3 style="clear: both; text-align: left;">
<span style="font-size: large;">
Sunplot</span></h3>
<div>
<br /></div>
<div>
The last several blogs have discussed the new statistical programming syntax for Streaming Expressions. What was missing in those blogs was <b>plotting</b>. Plotting plays a central role in statistical analysis. Plotting allows you to quickly understand the shape of your data in a way that the numbers alone cannot.</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
Sunplot is a new statistical plotting engine written by <a href="https://twitter.com/suzukimichael" target="_blank">Michael Suzuki</a> to work specifically with Solr's statistical programming syntax. This blog explores some of the features of Sunplot. </div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div style="clear: both; text-align: left;">
<b><span style="font-size: large;">
SQL and Statistical Expressions</span></b></div>
<div>
<br /></div>
<div>
<br />
Sunplot supports both SQL and Streaming Expressions. The SQL queries are sent to Solr's parallel SQL interface which evaluates the query across Solr Cloud collections. Streaming Expressions and statistical functions are evaluated by Solr's stream handler. </div>
<div>
<br /></div>
<div>
<br /></div>
<div>
Sunplot has a <b>json view</b>, <b>table view</b> and <b>charting view. </b>The image below shows a SQL query with results in the table view. </div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://4.bp.blogspot.com/-aaqsg_iB410/WYDAHU8UNRI/AAAAAAAAAkk/Y4rWKbZj7bw0lFSS01bbPjvHZX9Occ1NwCLcBGAs/s1600/Screen%2BShot%2B2017-08-01%2Bat%2B1.28.37%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="994" data-original-width="1600" height="198" src="https://4.bp.blogspot.com/-aaqsg_iB410/WYDAHU8UNRI/AAAAAAAAAkk/Y4rWKbZj7bw0lFSS01bbPjvHZX9Occ1NwCLcBGAs/s320/Screen%2BShot%2B2017-08-01%2Bat%2B1.28.37%2BPM.png" width="320" /></a></div>
<br />
<br />
<br />
<br />
The main code window handles both SQL and Streaming Expressions.<br />
<br />
<br />
<b><span style="font-size: large;">
The Plot Function</span></b><br />
<br />
<br />
Plotting of statistical functions is handled by the new<b> plot </b>function. The plot function allows you to specify arrays for the <b>x</b> and <b>y</b> axis and set the plot <b>type</b>. Supported plot types are scatter, line, bar and pie.<br />
<br />
Below is a screenshot of a very simple plot command:<br />
<div>
<br /></div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://4.bp.blogspot.com/-yg1LzT-FZLs/WYDUcj-6xlI/AAAAAAAAAlM/3pfxQcTCJj4Rjg64MrF2KeCoK9jRDSWjACLcBGAs/s1600/Screen%2BShot%2B2017-08-01%2Bat%2B3.19.38%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="199" src="https://4.bp.blogspot.com/-yg1LzT-FZLs/WYDUcj-6xlI/AAAAAAAAAlM/3pfxQcTCJj4Rjg64MrF2KeCoK9jRDSWjACLcBGAs/s320/Screen%2BShot%2B2017-08-01%2Bat%2B3.19.38%2BPM.png" width="320" /></a></div>
<div>
<br /></div>
<div>
Notice that the plot function is plotting hard-coded arrays. Using this approach you can use Sunplot as a general purpose plotting tool.</div>
<div>
<br /></div>
<div>
The plot function also plots arrays generated by Streaming Expressions and statistical functions.</div>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<b><span style="font-size: large;"><br /></span></b>
<b><span style="font-size: large;">
Scatter Plots</span></b><br />
<br />
<br />
One of the core statistical plot types is the scatter plot. A scatter plot can be used to quickly understand how individual samples are distributed. It is also very helpful in visualizing the outliers in a sample set.<br />
<br />
The screenshot below shows a statistical expression and scatter plot of the result set.<br />
<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-mPilji9GP7U/WYDAUkbADBI/AAAAAAAAAko/_HCK81XfC_E0Vde-UCu_WqhepzOesoaEACLcBGAs/s1600/Screen%2BShot%2B2017-08-01%2Bat%2B1.32.58%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="200" src="https://1.bp.blogspot.com/-mPilji9GP7U/WYDAUkbADBI/AAAAAAAAAko/_HCK81XfC_E0Vde-UCu_WqhepzOesoaEACLcBGAs/s320/Screen%2BShot%2B2017-08-01%2Bat%2B1.32.58%2BPM.png" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
Let's explore the statistical syntax shown in the screen shot and interpret the scatter plot.</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both;">
let(a=random(collection1, q="*:*", rows="500", fl="test_d"),</div>
<div class="separator" style="clear: both;">
b=col(a, test_d),</div>
<div class="separator" style="clear: both;">
plot(type=scatter, y=b))</div>
<div class="separator" style="clear: both; text-align: left;">
</div>
<ol>
<li>The<b><i> let</i></b> function is setting variables <b><i>a, b</i></b> and then executing the <b><i>plot</i></b> function.</li>
<li>Variable <b><i>a</i></b> is holding the output of the <b><i>random</i></b> function. The random function is returning 500 random result tuples from collection1. Each tuple has a single field called <b><i>test_d.</i></b></li>
<li>Variable <b><i>b</i></b> is holding the output of the <b><i>col</i></b> function. The col function returns a numeric array containing the values in the test_d field from the tuples stored in variable <b><i>a</i></b>. </li>
<li>The <b><i>plot</i></b> function returns the <b><i>x,y</i></b> coordinates and the plot <b><i>type </i></b>used by Sunplot to draw the plot. In the example the <b><i>y</i></b> access is set to the numeric array stored in variable <b><i>b</i></b>. If no <b><i>x</i></b> axis is provided the plot function will generate a sequence for the <b><i>x</i></b> axis. </li>
</ol>
<br />
<div class="separator" style="clear: both; text-align: left;">
<b>Reading the Scatter Plot</b></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
The scatter plot moves across the <b><i>x axis</i></b> from the left to right and plots the <b><i>y axis</i></b> for each point. This allows you to immediately see how the y axis points are spread. </div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
In the example you can tell a few things very quickly:</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
1) The points seem to fall fairly evenly above and below 500.</div>
<div class="separator" style="clear: both; text-align: left;">
2) The bulk of the points fall between 480 and 520.</div>
<div class="separator" style="clear: both; text-align: left;">
3) Virtually all of the points fall between 460 and 540.</div>
<div class="separator" style="clear: both; text-align: left;">
4) There are a few outliers below 460 and above 540.</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
This data set seems to have many of the characteristics of a <b>normal distribution</b>. In a normal distribution most of the points will be clustered above and below the <b>mean</b>. As you continue to move farther away from the mean the number of points taper off until there are just a few outliers.</div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<b>Sorting the Points</b></div>
<div class="separator" style="clear: both; text-align: left;">
<b><br /></b></div>
<div class="separator" style="clear: both; text-align: left;">
We can learn more about the data set by sorting the <b>y axis</b> points before plotting. In the example below note how the <i style="font-weight: bold;">asc </i>function is applied to first sort the <b>y axis</b> points before plotting.</div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://2.bp.blogspot.com/-7h8Faa0VPUk/WYDAod3yqRI/AAAAAAAAAks/DX7Q6usUeogXFCxUSIEi_jBS6oh11vzrQCLcBGAs/s1600/Screen%2BShot%2B2017-08-01%2Bat%2B1.34.33%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="999" data-original-width="1600" height="199" src="https://2.bp.blogspot.com/-7h8Faa0VPUk/WYDAod3yqRI/AAAAAAAAAks/DX7Q6usUeogXFCxUSIEi_jBS6oh11vzrQCLcBGAs/s320/Screen%2BShot%2B2017-08-01%2Bat%2B1.34.33%2BPM.png" width="320" /></a></div>
<br />
<br />
Once sorted you can see the how the lower outliers and upper outliers form curves with steeper slopes while the bulk of the points form a gently sloping line passing through the mean.<br />
<br />
<b><span style="font-size: large;"><br /></span></b>
<b><span style="font-size: large;">
Histograms</span></b><br />
<br />
<br />
Now that we've seen the scatter plot of the individual points we can continue to visualize the data by plotting a histogram with the points.<br />
<br />
Before plotting lets look at how to create a histogram and what a histogram output looks like:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://4.bp.blogspot.com/-TZRrVHHwTxI/WYEcrF-w_wI/AAAAAAAAAls/EA1ePWDruvo_VnJ72XXbTqO11z0ofq_TQCLcBGAs/s1600/Screen%2BShot%2B2017-08-01%2Bat%2B8.20.19%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="904" data-original-width="1600" height="180" src="https://4.bp.blogspot.com/-TZRrVHHwTxI/WYEcrF-w_wI/AAAAAAAAAls/EA1ePWDruvo_VnJ72XXbTqO11z0ofq_TQCLcBGAs/s320/Screen%2BShot%2B2017-08-01%2Bat%2B8.20.19%2BPM.png" width="320" /></a></div>
<br />
<div class="separator" style="clear: both;">
Let's explore the statistical expression that builds and outputs a histogram:</div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both;">
let(a=random(collection1, q="*:*", rows="500", fl="test_d"),</div>
<div class="separator" style="clear: both;">
b=col(a, test_d),</div>
<div class="separator" style="clear: both;">
c=hist(b, 7),</div>
<div class="separator" style="clear: both;">
get(c))</div>
<ol style="-webkit-text-stroke-width: 0px; color: black; font-family: Times; font-size: medium; font-variant-caps: normal; font-variant-ligatures: normal; letter-spacing: normal; orphans: 2; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px;">
<li>The<b style="font-style: normal;"><i><span style="font-weight: normal;"> </span>let</i></b><span style="font-style: normal; font-weight: normal;"> </span>function is setting variables <b><i>a, b, c</i></b> and then executes the <b><i>get</i></b><span style="font-style: normal; font-weight: normal;"> </span>function.</li>
<li style="font-style: normal; font-weight: normal;">Variable <b><i>a</i></b> is holding the output of the <b><i>random</i></b> function. The random function is returning 500 random result tuples from collection1. Each tuple has a single field called <b><i>test_d.</i></b></li>
<li style="font-style: normal; font-weight: normal;">Variable <b><i>b</i></b> is holding the output of the <b><i>col</i></b> function. The col function returns a numeric array containing the values in the test_d field from the tuples stored in variable <b><i>a</i></b>. </li>
<li><span style="font-style: normal; font-weight: normal;">Variable </span><b><i>c</i></b> is holding the output of the <b><i>hist</i></b> function. The hist function creates a histogram with 7 bins from the numeric array stored in variable <b><i>b</i></b>. The histogram returns one tuple for each bin with a statistical summary of the bin.</li>
<li>The<span style="font-style: normal; font-weight: normal;"> </span><b><i>get</i></b><span style="font-style: normal; font-weight: normal;"> </span>function returns the list of histogram tuples held in variable c.</li>
</ol>
The screenshot above shows the histogram results listed in table view. Each row in the table represents a bin in the histogram. The <b><i>N</i></b> field is the number of observations that fall within the bin. The <b><i>mean</i></b> is the mean value of observations within the bin.<br />
<br />
To plot the histogram will need to extract the <b><i>N</i></b> and <b><i>mean</i></b> columns into arrays. We will then use the <b><i>mean</i></b> array as the x axis and the <b><i>N</i></b> array as the y axis. We will use 11 bins for the plot.<br />
<br />
The screen shot below shows the statistical expression and plot of the histogram:<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://2.bp.blogspot.com/-WY4NSeGcfSs/WYDAzPXZ4zI/AAAAAAAAAkw/so_A85lYr7AApcqZfbESDFAvSgO2yvj7wCLcBGAs/s1600/Screen%2BShot%2B2017-08-01%2Bat%2B1.37.40%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="998" data-original-width="1600" height="199" src="https://2.bp.blogspot.com/-WY4NSeGcfSs/WYDAzPXZ4zI/AAAAAAAAAkw/so_A85lYr7AApcqZfbESDFAvSgO2yvj7wCLcBGAs/s320/Screen%2BShot%2B2017-08-01%2Bat%2B1.37.40%2BPM.png" width="320" /></a></div>
<br />
The histogram plot has the bell curve you would expect to see with a normal distribution. Both the scatter plot and histogram plot are pointing to a normal distribution.<br />
<br />
Now we'll take a quick look at a statistical test to confirm that this data is a normal distribution.<br />
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<b><span style="font-size: large;"><br /></span></b>
<b><span style="font-size: large;">
Descriptive Statistics</span></b><br />
<br />
<br />
First lets compute the descriptive statistics for the sample set with the describe function:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://2.bp.blogspot.com/-gcbkbNhb3XU/WYDBFwF3j-I/AAAAAAAAAk4/UNnIGjpIbKAzO7Ed5O5_rFhRjCEHgRqNwCLcBGAs/s1600/Screen%2BShot%2B2017-08-01%2Bat%2B1.40.20%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1600" height="200" src="https://2.bp.blogspot.com/-gcbkbNhb3XU/WYDBFwF3j-I/AAAAAAAAAk4/UNnIGjpIbKAzO7Ed5O5_rFhRjCEHgRqNwCLcBGAs/s320/Screen%2BShot%2B2017-08-01%2Bat%2B1.40.20%2BPM.png" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
The statistical expression above outputs a single tuple with the descriptive statistics for the sample set. Notice that the sample has a mean of 500 and a standard deviation of 20. Both the scatter and histogram plots provide visual confirmation of these statistics.</div>
<h3 style="clear: both; text-align: left;">
</h3>
<h3 style="clear: both; text-align: left;">
</h3>
<h3 style="clear: both; text-align: left;">
</h3>
<div style="clear: both; text-align: left;">
<b><br /></b>
<b><span style="font-size: large;">
Normal Distribution Testing With Kolmogorov–Smirnov Test</span></b></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
Now that we know the mean and standard deviation we have enough information to run a one sample Kolmogorov–Smirnov (k-s) Test. A one sample k-s test is used to determine if a sample data set fits a reference distribution. </div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
The screenshot below shows the syntax and output for the k-s test: </div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-rNk35R7O6S4/WYDBPzD0EnI/AAAAAAAAAk8/AKP7wHtgEhYKfl9ZxR-sSjWXbQ9jcXfqwCLcBGAs/s1600/Screen%2BShot%2B2017-08-01%2Bat%2B1.48.46%2BPM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="996" data-original-width="1600" height="199" src="https://3.bp.blogspot.com/-rNk35R7O6S4/WYDBPzD0EnI/AAAAAAAAAk8/AKP7wHtgEhYKfl9ZxR-sSjWXbQ9jcXfqwCLcBGAs/s320/Screen%2BShot%2B2017-08-01%2Bat%2B1.48.46%2BPM.png" width="320" /></a></div>
<br />
The expression in the example calls the <b><i>normalDistribution</i></b> function which returns a reference distribution for the ks function. The normalDistribution function is created with a mean of 500 and standard deviation of 20 which is the same as the sample set.<br />
<br />
The <b><i>ks</i></b> function is then run using the reference distribution and the sample set.<br />
<br />
<span style="background-color: white; color: #222222; font-family: "arial" , "tahoma" , "helvetica" , "freesans" , sans-serif; font-size: 13.2px;">The <b><i>p-value</i></b> returned from the ks test is 0.38. This means that there is a 38% chance you would be wrong if you rejected the hypothesis that the sample set could have been taken from the reference distribution. Typically a p-value of .05 or lower is taken as evidence that we can reject the test hypothesis. </span><br />
<span style="background-color: white; color: #222222; font-family: "arial" , "tahoma" , "helvetica" , "freesans" , sans-serif; font-size: 13.2px;"><br /></span>
<span style="background-color: white; color: #222222; font-family: "arial" , "tahoma" , "helvetica" , "freesans" , sans-serif; font-size: 13.2px;">Based on the p-value the ks test confirms that the sample set fits a normal distribution.</span><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />Joel Bernsteinhttp://www.blogger.com/profile/15670652646728608504noreply@blogger.comtag:blogger.com,1999:blog-447314081038493192.post-84997600508340483012017-07-12T13:44:00.000-07:002017-07-12T14:02:27.956-07:00Detrending Time Series Data With Linear Regression in Solr 7Often when working with time series data there is a linear trend present in the data. For example if a stock price has been gradually rising over a period of months you'll see a positive slope in the time series data. This slope over time is the trend. Before performing statistical analysis on the time series data it's often necessary to remove the trend.<br />
<div>
<br /></div>
<div>
Why is a trend problematic? Consider an example where you want to correlate two time series that are trending on a similar slope. Because they both have a similar slope they will appear to be correlated. But in reality they may be trending for entirely different reasons. To tell if the two time series are actually correlated you would need to first remove the trends and then perform the correlation on the detrended data. </div>
<div>
<br /></div>
<div>
<h3>
Linear Regression </h3>
<div>
<div>
<br /></div>
<div>
Linear regression is a statistical tool used to measure the linear relationship between two variables. For example you could use linear regression to determine if there is a linear relationship between <b>age</b> and <b>medical costs</b>. If a linear relationship is found you can use linear regression to predict the value of a dependent variable based on the value of an independent variable.</div>
<div>
<br /></div>
<div>
Linear regression can also be used to remove a linear trend from a time series.</div>
<div>
<br /></div>
<h3>
Removing a Linear Trend from a Time Series </h3>
<div>
<br /></div>
<div>
We can remove a linear trend from a time series using the following technique:</div>
<div>
<br /></div>
<div>
<ol>
<li>Regress the <b>dependent </b>variable over a <b>time sequence</b>. For example if we have 12 months of time series observations the time sequence would be expressed as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12.</li>
<li>Use the regression analysis to predict a dependent value at each time interval. Then subtract the <b>prediction</b> from the <b>actual </b>value. The difference between actual and predicted value is known as the <b>residual</b>. The residuals array is the time series with the trend removed. You can now perform statistical analysis on the residuals.</li>
</ol>
</div>
<div>
Sounds complicated, but an example will make this more clear and Solr makes this all very easy to do.</div>
<div>
<br /></div>
<div>
<h3>
Example: Exploring the linear relationship between<b> marketing spend</b> and <b>site usage</b>.</h3>
<div>
<br /></div>
<div>
In this example we want explore the linear relationship between <b>marketing spend</b> and <b>website usage.</b> The motivation for this is to determine if higher marketing spend causes higher website usage. </div>
<div>
<br /></div>
<div>
Website usage has been trending upwards for over a year. We have been varying the marketing spend throughout the year to experiment with how different levels of marketing spend impacts website usage. </div>
<div>
<br /></div>
<div>
Now we want to regress the marketing spend and the website usage to build a simple model of how usage is impacted by marketing spend. But before we can build this model we must remove the trend from the website usage or the cumulative effect of the trend will mask the relationship between marketing spend and website usage.</div>
<div>
<br /></div>
<div>
Here is the streaming expression:</div>
<div>
<br /></div>
<div>
let(a=timeseries(logs, </div>
<div>
q="rec_type:page_view", </div>
<div>
field="rec_time", </div>
<div>
start="<span style="background-color: white; color: #222222; font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif; font-size: 13.2px;">2016-01-01T00:00:00Z</span>", </div>
<div>
end="<span style="background-color: white; color: #222222; font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif; font-size: 13.2px;">2016-12-31T00:00:00Z</span>", </div>
<div>
gap="+1MONTH", </div>
<div>
count(*)),</div>
<div>
b=jdbc(connection="jdbc:mysql://...", </div>
<div>
sql="select marketing_expense from monthly_expenses where ..."),</div>
<div>
c=col(a, count(*)),</div>
<div>
d=col(b, marketing_expense),</div>
<div>
e=sequence(length(c), 1, 1),</div>
<div>
f=regress(e, c),</div>
<div>
g=residuals(f, e, c),</div>
<div>
h=regress(d, g),</div>
<div>
tuple(regression=h)) </div>
<div>
<br /></div>
<div>
Let's break down what this expression is doing:</div>
<div>
<br /></div>
<div>
<ol>
<li>The <b><i>let</i></b> expression is setting the variables<b><i> a, b, c, d, e, f, g, h</i></b> and returning a single result tuple.</li>
<li>Variable <b><i>a</i></b> is holding the result tuples from a <b><i>timeseries</i></b> function that is querying the logs for monthly usage counts. </li>
<li>Variable <b><i>b</i></b> is holding the result tuples from a <b><i>jdbc </i></b>function which is querying an external database for monthly marketing expenses.</li>
<li>Variable <b><i>c</i></b> is holding the output from a <b><i>col</i></b> function which returns the values in the <b><i>count(*)</i></b> field from the tuples stored in variable <i style="font-weight: bold;">a. </i>This is an array containing the monthly usage counts.</li>
<li>Variable <b><i>d</i></b> is holding the output from a <b><i>col</i></b> function which returns the values in the <b><i>marketing_expense</i></b> field from the tuples stored in variable <b><i>b</i></b><i style="font-weight: bold;">. </i>This is an array containing the monthly marketing expenses.</li>
<li>Variable <i style="font-weight: bold;">e </i>holds the output of the <b><i>sequence</i></b> function which returns an array of numbers the same length as the array in variable <b style="font-style: italic;">c</b>. The sequence starts from 1 and has a stride of 1. </li>
<li>Variable <i style="font-weight: bold;">f </i>holds the output of the <b><i>regress</i></b> function which returns a regression result. The regression is performed with the sequence in variable <b><i>e</i></b> as the independent variable and monthly usage counts in variable <b><i>c</i></b> as the dependent variable.</li>
<li>Variable <b><i>g</i></b> holds the output of the <b><i>residuals</i></b> function which returns the residuals from applying the regression result to the data sets in variables <b><i>e</i></b> and <b><i>c</i></b>. <b>The residuals are the monthly usage counts with the trend removed</b>.</li>
<li>Variable <b><i>h</i></b> holds the output of the <b><i>regress</i></b> function which returns a regression result. The regression is being performed with the <b><i>marketing expenses</i></b> (variable <b><i>d)</i></b> as the independent variable. The<b><i> residuals</i></b> from the monthly usage regression (variable <b><i>g</i></b>) are the dependent variable. This regression result will describe the linear relationship between marketing expenses and site usage.</li>
<li>The output tuple is returning the regression result.</li>
</ol>
</div>
<div>
</div>
</div>
</div>
</div>
Joel Bernsteinhttp://www.blogger.com/profile/15670652646728608504noreply@blogger.comtag:blogger.com,1999:blog-447314081038493192.post-7884699692082619102017-07-09T19:59:00.001-07:002017-07-09T20:22:38.416-07:00One-way ANOVA and Rank Transformation with Solr's Streaming ExpressionsIn the previous <a href="http://joelsolr.blogspot.com/2017/06/random-sampling-histograms-and-point.html" target="_blank">blog</a> we explored the use of <b>random</b> <b>sampling</b> and <b>histograms </b>to pick a threshold for <b>point-wise</b> anomaly detection. Point-wise anomaly detection is a good place to start, but alerting based on a single anomalous point may lead to false alarms. What we need is a statistical technique that can help confirm that the problem goes beyond a single point.<br />
<br />
<h3>
Spotting Differences In Sets of Data</h3>
<br />
The specific example in the last blog dealt with finding <b>individual log records </b>with unusually high response times. In this blog we'll be looking for <b>sets of log records</b> with unusually high response times.<br />
<br />
One approach to doing this is to compare the <b><span style="color: #660000;">means </span></b>of response times between different sets of data. For this we'll use a statistical approach called One-way Anova.<br />
<br />
<h3>
One-way ANOVA (Analysis of Variance)</h3>
<br />
The Streaming Expression statistical library includes the <b>anova</b> function. The anova function is used to determine if the <b>difference in means</b> between two or more sample sets is statistically significant.<br />
<br />
In the example below we'll use ANOVA to compare two samples of data:<br />
<br />
<ol>
<li>A sample taken from a known period of normal response times.</li>
<li>A sample taken before and after the point-wise anomaly.</li>
</ol>
<div>
If the difference in means between the two sets is statistically significant we have evidence that the data around the anomalous data point is also unusual.</div>
<br />
<br />
<h3>
Accounting For Outliers</h3>
<br />
We already know that sample #2 has at least one outlier point. A few large outliers could skew the mean of a sample #2 and bias the ANOVA calculation. <br />
<br />
In order to determine if sample set #2 as a whole has a higher mean then sample #1 we need a way to decrease the effect of outliers on the ANOVA calculation.<br />
<br />
<h3>
Rank Transformation</h3>
<br />
One approach for smoothing outliers is to first rank transform the data sets before running the ANOVA. Rank transformation transforms each value in the data to an ordinal ranking.<br />
<br />
The Streaming Expression function library includes the <b>rank</b> function which performs the rank transformation.<br />
<br />
In order to compare the data sets following the rank transform, we'll need to perform the rank transformation on both sets of data as if they were one contiguous data set. Streaming Expressions provides array manipulation functions that will allow us do this.<br />
<br />
<br />
<h3>
The Streaming Expression</h3>
<br />
In the expression below we'll perform the ANOVA:<br />
<br />
let(a=random(logs,<br />
q="rec_time:[2017-05 TO 2017-06]",<br />
fq="file_name:index.html",<br />
fl="response_time",<br />
rows="7000"),<br />
b=random(logs,<br />
q="rec_time:[NOW-10MINUTES TO NOW]", <br />
fq="file_name:index.html",<br />
fl="response_time",<br />
rows="7000"),<br />
c=col(a, response_time),<br />
d=col(b, response_time),<br />
e=addAll(c, d),<br />
f=rank(e),<br />
g=copyOfRange(f, 0, length(c)),<br />
h=copyOfRange(f, length(c), length(f)),<br />
i=anova(g, h),<br />
tuple(results=i))<br />
<br />
Let's break down what this expression is doing:<br />
<br />
<ol>
<li>The let expression is setting the variables <b><i>a, b, c, d, e, f, g, h, i</i></b> and returning a single response tuple.</li>
<li>The variable <i style="font-weight: bold;">a </i>holds the tuples from a random sample of response times from a period of normal response times (sample set #1).</li>
<li>The variable<b><i> b</i></b> holds the tuples from a random sample of response times before and after the anomalous data point (sample set #2).</li>
<li>Variables <b><i>c</i></b> and <i style="font-weight: bold;">d </i>hold results of the <b>col</b> function which returns a column of numbers from a list of tuples. Sample set #1 is in variable <b><i>c</i></b>. Sample set #2 is in variable <b><i>d</i></b>.</li>
<li>Variable <i style="font-weight: bold;">e </i>holds the result of the addAll function which is returning a single array containing the contents of variables <b><i>c</i></b> and <b><i>d</i></b>.</li>
<li>Variable <b><i>f</i></b> holds the results of the rank function which performs the rank transformation on variable <i><b>e</b></i>. </li>
<li>Variables <b><i>g</i></b> and <b><i>h </i></b>hold the values of copyOfRange functions. The copyOfRange function is used to separate the single rank transformed array back into two data sets. Variable <b><i>g</i></b> holds the rank transformed values of sample set #1. Variable <b><i>h</i></b> holds the rank transformed values of sample set #2.</li>
<li>Variable<b><i> i </i></b>holds the result of the anova function which is performing the ANOVA on variable <b><i>g</i></b> and <b style="font-style: italic;">h</b>.</li>
<li>The response tuple has a single field called <i style="font-weight: bold;">results </i>that contains the results of the ANOVA on the the rank transformed data sets.</li>
</ol>
<br />
<h3>
Interpreting the ANOVA p-value</h3>
<br />
The response from the Streaming Expression above looks like this:<br />
<br />
<span style="font-family: monospace; font-size: 12px; white-space: pre;">{
"result-set": {
"docs": [
{
"results": {
"p-value": 0.0008137581457111631,
"f-ratio": 38.4
}
},
{
"EOF": true,
"RESPONSE_TIME": 789
}
]
}
}</span><br />
<span style="font-family: monospace; font-size: 12px; white-space: pre;"><br /></span>
<br />
The p-value of 0.0008 is the percentage chance that there is NOT a statistically significant difference in the means between the two sample sets.<br />
<br />
Based on this p-value we can say with a very high level of confidence that there is a statistically significant difference in the means between the two sample sets.<br />
<br />Joel Bernsteinhttp://www.blogger.com/profile/15670652646728608504noreply@blogger.comtag:blogger.com,1999:blog-447314081038493192.post-32177320272109685832017-06-28T18:59:00.001-07:002017-06-28T19:33:12.760-07:00Random Sampling, Histograms and Point-wise Anomaly Detection In SolrIn the <a href="http://joelsolr.blogspot.com/2017/05/statistical-programming-with-solr.html" target="_blank">last blog</a> we started to explore Streaming Expression's new statistical programming functions. The last blog described a statistical expression that retrieved two data sets with <b>SQL expressions,</b> computed the <b>moving averages</b> for the data sets and <b>correlated </b>the moving averages.<br />
<br />
In this blog we'll explore <b>random sampling</b>, <b>histograms</b> and rule based <b>point-wise</b> anomaly detection.<br />
<br />
<h3>
Turning Mountains into Mole Hills with Random Sampling</h3>
<br />
Random sampling is one of the most powerful concepts in statistics. Random sampling involves taking a smaller random sample from a larger data set, which can be used to infer statistics about the larger data set.<br />
<br />
Random sampling has been used for decades to deal with the problem of not having access to the entire data set. For example taking a poll of everyone in a large population may not be feasible. Taking a random sample of the population is likely much more feasible.<br />
<br />
In the big data age we are often presented with a different problem: too much data. It turns out that random sampling helps solve this problem as well. Instead of having to process the entire massive data set we can select a random sample of the data set and infer statistics about the larger data set.<br />
<br />
<span style="color: #660000; font-weight: bold;">Note: </span>It's important to understand that working with random samples does introduce potential statistical error. There are formulas for determining the margin of error given specific sample sizes. This <a href="http://www.research-advisors.com/tools/SampleSize.htm" target="_blank">link</a> also provides a sample size table which shows margin of errors for specific sample sizes.<br />
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
<br /></h3>
<h3>
Solr is a Powerful Random Sampling Engine</h3>
<br />
Slicing, dicing and creating random samples from large data sets are some of the primary capabilities needed to tackle big data statistical problems. Solr happens to be one of the best engines in the world for doing this type of work.<br />
<br />
Solr has had the ability to select random samples from search results for a long time. The new statistical syntax in Streaming Expressions makes this capability much more powerful. Now Solr has the power to select random samples from large distributed data sets and perform statistical analysis on the random samples.<br />
<h3>
<br />The Random Streaming Expression</h3>
<br />
The random Streaming Expression retrieves a pseudo random set of documents that match a query. Each time the random expression is run it will return a different set of pseudo random records.<br />
<br />
The syntax for the random expression is:<br />
<br />
random(collection1, q="soly query", fl="fielda, fieldb", rows="17000")<br />
<br />
This simple but powerful expression selects 17,000 pseudo random records from a Solr Cloud collection that matches the query.<br />
<br />
<h3>
Understanding Data Distributions with Histograms</h3>
<br />
Another important statistical tool is the <a href="https://en.wikipedia.org/wiki/Histogram" target="_blank">histogram</a>. Histograms are used to understand the distribution of a data set. Histograms divide a data set into bins and provides statistics about each bin. By inspecting the statistics of each bin you can understand the distribution of the data set.<br />
<br />
<h3>
The hist Function</h3>
<br />
Solr's Streaming Expression library has a <b>hist</b> function which returns a histogram for an array of numbers.<br />
<br />
The hist function has a very simple syntax:<br />
<br />
hist(col, 10)<br />
<br />
The function above takes two parameters:<br />
<br />
<ol>
<li>An array of numbers</li>
<li>The number of bins in the histogram</li>
</ol>
<br />
<h3>
Creating a Histogram from a Random Sample</h3>
<br />
Using the Streaming Expression statistical syntax we can combine random sampling and histograms to understand the distribution of large data sets.<br />
<br />
In this example we'll work with a sample data set of log records. Our goal is to create a histogram of the response times for the home page.<br />
<br />
Here is the basic syntax:<br />
<br />
let(a=random(logs, q="file_name:index.html", fl="response_time", rows="17000"),<br />
b=col(a, response_time),<br />
c=hist(b, 10),<br />
tuple(hist=c))<br />
<br />
Let's break down what this expression is doing:<br />
<br />
1) The <b>let</b> expression is setting variables <b><i>a, b </i></b>and <b><i>c</i></b> and then returning a single response <b>tuple</b>.<br />
<br />
2) Variable <i><b>a</b></i> stores the result tuples from the random streaming expression. The random streaming expression is returning 17000 pseudo random records from the <b>logs</b> collection that match the query file_name:index.html.<br />
<br />
3) Variable <b><i>b</i></b> stores the output of the <b>col</b> function. The col function returns a column of numbers from a list of tuples. In this case the list of tuples is held in the variable <b><i>a</i></b>. The field name is response_time.<br />
<br />
4) Variable <b><i>c</i></b> stores the output of the <b>hist </b>function. The hist function returns a histogram from a column of numbers. In this case the column of numbers is stored in variable <i style="font-weight: bold;">b. </i>The number of bins in the histogram is 10.<br />
<br />
5) The tuple expression returns a single output tuple with the <b>hist</b> field set to variable <i style="font-weight: bold;">c, </i>which contains the histogram.<br />
<br />
The output from this expression is a histogram with 10 bins describing the random sample of home page response times. Descriptive statistics are provided for each bin.<br />
<br />
By looking at the histogram we can gain a full understanding of the distribution of the data. Below is a sample histogram. Note that <b><i>N</i></b> is the number of observations that are in the bin.<br />
<br />
<span style="font-family: monospace; font-size: 12px; white-space: pre;">{
"result-set": {
"docs": [
{
"hist": [
{
"min": 105.80360488681794,
"max": 184.11423669457605,
"mean": 158.07101244548903,
"var": 676.6416949523991,
"sum": 1106.4970871184232,
"stdev": 26.012337360421864,
"N": 7
},
{
"min": 187.1450299482844,
"max": 262.86798264568415,
"mean": 235.8519937762809,
"var": 400.7486779625581,
"sum": 31368.315172245355,
"stdev": 20.01870819914607,
"N": 133
},
{
"min": 263.6907639320808,
"max": 341.7723630856346,
"mean": 312.0580142849335,
"var": 428.02686585995957,
"sum": 259944.32589934967,
"stdev": 20.688810160566497,
"N": 833
},
{
"min": 342.0007054044787,
"max": 420.508689773685,
"mean": 387.10102356966337,
"var": 497.5116682425222,
"sum": 1008398.166398972,
"stdev": 22.30496958622724,
"N": 2605
},
{
"min": 420.5348042867488,
"max": 499.173632576587,
"mean": 461.5725595026505,
"var": 505.85122370654324,
"sum": 2267244.4122770214,
"stdev": 22.491136558798964,
"N": 4912
},
{
"min": 499.23963590242806,
"max": 577.8765472307315,
"mean": 535.9950922008038,
"var": 500.5743269892825,
"sum": 2589928.2855142825,
"stdev": 22.373518431156118,
"N": 4832
},
{
"min": 577.9106064943256,
"max": 656.5613165857329,
"mean": 611.5787667510084,
"var": 481.60546877783116,
"sum": 1647593.1976272168,
"stdev": 21.945511358312686,
"N": 2694
},
{
"min": 656.5932936523765,
"max": 734.7738394881361,
"mean": 685.4426886363782,
"var": 451.02322430952523,
"sum": 573715.5303886493,
"stdev": 21.237307369568423,
"N": 837
},
{
"min": 735.9448445737111,
"max": 812.751632738434,
"mean": 762.5240648996678,
"var": 398.4721757713377,
"sum": 102178.22469655548,
"stdev": 19.961767851854646,
"N": 134
},
{
"min": 816.2895922221702,
"max": 892.6066799061479,
"mean": 832.5779161364087,
"var": 481.68131277525964,
"sum": 10823.512909773315,
"stdev": 21.94723929735263,
"N": 13
}
]
},
{
"EOF": true,
"RESPONSE_TIME": 986
}
]
}
}</span><br />
<br />
<h3>
Point-wise Anomaly Detection</h3>
<br />
Point-wise anomaly detection deals with finding a single anomalous data point.<br />
<br />
Based on the histogram we can devise a rule for detecting when an anomaly response time appears in the logs. For this example let's set a rule that any response time that falls within the last two bins is an anomaly. The specific rule would be:<br />
<br />
response_time > 735<br />
<br />
<h3>
Creating an Alert With the Topic Streaming Expression</h3>
<br />
Now that we have a rule for detecting anomaly response times we can use the <b>topic</b> expression to return all new records in the logs collection that match the anomaly rule. The topic expression would look like this:<br />
<br />
topic(checkpoints,<br />
logs,<br />
q="file_name:index.html AND response_time:[735 TO *]", <br />
fl="id, response_time",<br />
id="response_anomalies")<br />
<br />
The expression above provides one time delivery of all records that match the anomaly rule. Notice the anomaly rule is the query for the topic expression. This is a very efficient approach for retrieving just the anomaly records.<br />
<br />
We can wrap the topic in an <b>update</b> and <b>daemon</b> expression to run the topic at intervals and store anomaly records in another collection. The collection of anomalies can then be used for alerting.Joel Bernsteinhttp://www.blogger.com/profile/15670652646728608504noreply@blogger.comtag:blogger.com,1999:blog-447314081038493192.post-75550700782160876822017-05-30T19:27:00.000-07:002017-07-07T19:00:46.587-07:00Statistical programming with Solr Streaming ExpressionsIn the previous blog we explored the new <b>timeseries</b> function and introduced the syntax for <b>math expressions</b>. In this blog we'll dive deeper into math expressions and explore the statistical programming functions rolling out in the next release.<br />
<br />
Let's first learn how the statistical expressions work and then look at how we can perform statistical analysis on retrieved result sets.<br />
<br />
<h3>
Array Math</h3>
<br />
The statistical functions create, manipulate and perform math on<b> arrays</b>. One of the basic things that we can do is create an array with the array function:<br />
<br />
array(2, 3, 4, 3, 6)<br />
<br />
The array function simply returns an array of numbers. If we send the array function above to Solr's stream handler it responds with:<br />
<br />
<span style="font-family: monospace; font-size: 12px; white-space: pre;">{
"result-set": {
"docs": [
{
"return-value": [
2,
3,
4,
3,
6
]
},
{
"EOF": true,
"RESPONSE_TIME": 1
}
]
}
}</span><br />
<br />
Notice that the stream handler returns a single Tuple with the <b>return-value</b> field pointing to the array. This is how Solr responds when given a statistical function to evaluate.<br />
<br />
This is a new behavior for Solr. In the past the stream handler always returned streams of Tuples. Now the stream handler can directly perform mathematical functions.<br />
<br />
Let's explore a few more of the new array math functions. We can manipulate arrays in different ways. For example we can reverse the array like this:<br />
<br />
rev(array(2, 3, 4, 3, 6))<br />
<br />
Solr returns the following from this expression:<br />
<br />
<span style="font-family: monospace; font-size: 12px; white-space: pre;">{
"result-set": {
"docs": [
{
"return-value": [
6,
3,
4,
3,
2
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}</span><br />
<div>
<br /></div>
We can describe the array:<br />
<br />
describe(array(2, 3, 4, 3, 6))<br />
<br />
<span style="font-family: monospace; font-size: 12px; white-space: pre;">{
"result-set": {
"docs": [
{
"return-value": {
"sumsq": 74,
"max": 6,
"var": 2.3000000000000003,
"geometricMean": 3.365865436338599,
"sum": 18,
"kurtosis": 1.4555765595463175,
"N": 5,
"min": 2,
"mean": 3.6,
"popVar": 1.8400000000000003,
"skewness": 1.1180799331493778,
"stdev": 1.5165750888103102
}
},
{
"EOF": true,
"RESPONSE_TIME": 31
}
]
}
}</span><br />
<br />
Now we see our first bit of statistics. The describe function provides <b>descriptive statistics</b> for the array.<br />
<br />
We can correlate arrays:<br />
<br />
corr(array(2, 3, 4, 3, 6),<br />
array(-2, -3, -4, -3, -6))<br />
<br />
This returns:<br />
<br />
<span style="font-family: monospace; font-size: 12px; white-space: pre;">{
"result-set": {
"docs": [
{
"return-value": -1
},
{
"EOF": true,
"RESPONSE_TIME": 2
}
]
}
}</span><br />
<div>
<br /></div>
<br />
The <b>corr</b> function performs the <b>Pearson Product Moment </b>correlation on the two arrays. In this case the arrays are perfectly negatively correlated.<br />
<br />
We can perform a simple regression on the arrays:<br />
<br />
regress(array(2, 3, 4, 3, 6),<br />
array(-2, -3, -4, -3, -6))<br />
<br />
<span style="font-family: monospace; font-size: 12px; white-space: pre;">{
"result-set": {
"docs": [
{
"return-value": {
"significance": 0,
"totalSumSquares": 9.2,
"R": -1,
"meanSquareError": 0,
"intercept": 0,
"slopeConfidenceInterval": 0,
"regressionSumSquares": 9.2,
"slope": -1,
"interceptStdErr": 0,
"N": 5
}
},
{
"EOF": true,
"RESPONSE_TIME": 9
}
]
}
}</span><br />
<span style="font-family: monospace; font-size: 12px; white-space: pre;"><br /></span>
<br />
All statistical functions in the initial release are backed by <b>Apache Commons Math</b>. The initial release includes a core group of functions that support:<br />
<br />
<ul>
<li>Rank transformations</li>
<li>Histograms</li>
<li>Percentiles</li>
<li>Simple regression and predict functions</li>
<li>One way ANOVA</li>
<li>Correlation</li>
<li>Covariance</li>
<li>Descriptive statistics</li>
<li>Convolution</li>
<li>Finding the delay in signals/time series</li>
<li>Lagged regression</li>
<li>Moving averages</li>
<li>Sequence generation</li>
<li>Calculating Euclidean distance between arrays</li>
<li>Data normalization and scaling</li>
<li>Array creation and manipulation functions</li>
</ul>
<b>Statistical functions can be applied to:</b><br />
<ol>
<li> Time series result sets</li>
<li> Random sampling result sets</li>
<li> SQL result sets (Solr's Internal Parallel SQL)</li>
<li> JDBC result sets (External JDBC Sources)</li>
<li> K-Nearest Neighbor results sets</li>
<li> Graph Expression result sets</li>
<li> Search result sets</li>
<li> Faceted aggregation result sets</li>
<li> MapReduce result sets </li>
</ol>
<br />
<h3>
<br />Array Math on Solr Result Sets</h3>
<br />
Let's now explore how we can apply statistical functions on Solr result sets. In the example below we'll correlate arrays of moving averages for two stocks:<br />
<br />
let(stockA = sql(stocks, stmt="select closing_price from price_data where ticker='aaa' and ..."),<br />
stockB = sql(stocks, stmt="select closing_price from price_data where ticker='bbb' and ..."),<br />
pricesA = col(stockA, closing_price),<br />
pricesB = col(stockB, closing_price),<br />
movingA = movingAvg(pricesA, 30),<br />
movingB = movingAvg(pricesB, 30),<br />
tuple(correlation=corr(movingA, movingB)))<br />
<br />
Let's break down how this expression works:<br />
<br />
1) The <b>let</b> expression is <b>setting variables</b> and then returning a single <b>output tuple</b>.<br />
<br />
2) The first two variables <b>stockA</b> and <b>stockB </b>contain result sets from sql expressions. The sql expressions return tuples with the closing prices for stock tickers aaa and bbb.<br />
<br />
3) The next two variables <b>pricesA</b> and<b> pricesB </b>are created by the <b>col</b> function. The col function creates a numeric array from a list of Tuples. In this example pricesA contains the closing prices for stockA and pricesB contains the closing prices for stockB.<br />
<br />
4) The next two variables <b>movingA</b> and <b>movingB </b>are created by the movingAvg function. In this example movingA and movingB contain arrays with the moving averages calculated from the pricesA and pricesB arrays.<br />
<br />
5) In the final step we output a single Tuple containing the correlation of the movingA and movingB arrays. The correlation is computed using the <b>corr</b> function.<br />
<br />Joel Bernsteinhttp://www.blogger.com/profile/15670652646728608504noreply@blogger.comtag:blogger.com,1999:blog-447314081038493192.post-51702492537672915052017-05-01T18:54:00.001-07:002017-05-03T19:35:31.926-07:00Exploring Solr's New Time Series and Math Expressions<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">In Solr 6.6 the Streaming Expression library has added support for <b>time series</b> and <b>math expressions</b>. This blog will walk through an example of how to use these exciting features.</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<br />
<h3>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif; font-size: small;">Time Series</span></h3>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">Time series aggregations are supported through the<b> timeseries</b> Streaming Expression. The timeseries expression uses the json facet api under the covers so the syntax will be familiar if you've used Solr date range syntax.</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">Here is the basic syntax:</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">timeseries(collection, </span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> field="test_dt", </span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> q="*:*",</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> start="2012-05-01T00:00:00Z",</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> end="2012-06-30T23:59:59Z",</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> gap="+1MONTH", </span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> count(*))</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">When sent to Solr this expression will return results that look like this:</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="white-space: pre;"><span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">{
"result-set": {
"docs": [
{
"test_dt": "2012-05-01T00:00:00Z",
"count(*)": 247007
},
{
"test_dt": "2012-06-01T00:00:00Z",
"count(*)": 247994
},
{
"EOF": true,
"RESPONSE_TIME": 9
}
]
}
}</span></span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">Solr takes care of the date math and builds the time range buckets automatically. Solr also fills in any gaps in the range with buckets automatically and adds zero aggregation values. Any Solr query can be used to select the records. </span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">The supported aggregations are: count(*), sum(field), avg(field), min(field), max(field).</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">The timeseries function is quite powerful on it's own, but it grows in power when combined with math expressions.</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<br />
<h3>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif; font-size: small;">Math Expressions</span></h3>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">In Solr 6.6 the Streaming Expression library also adds math expressions. This is a larger topic then one blog can cover, but I'll hit some of highlights by slowly building up a math expression.</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<br />
<h3>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif; font-size: small;">Let and Get</span></h3>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">The fun begins with the <b>let</b> and <b>get </b>expressions. <b>let</b> is used to assign tuple streams to <b>variables</b> and <b>get</b> is used to retrieve the stream later in the expression. Here is the most basic example:</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">let(a=timeseries(collection, field="test_dt", q="*:*",</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> start="2012-05-01T00:00:00Z",</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> end="2012-06-30T23:59:59Z",</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> gap="+1MONTH", </span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> count(*)),</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> get(a))</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">In the example above the timeseries expression is being set to the variable <i><b>a</b></i>. Then the <b>get</b> expression is used to turn the variable <b><i>a</i></b> back into a stream.</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">The let expression allows you to set any number of variables, and assign a single Streaming Expression to run the program logic. The expression that runs the program logic has access to the variables. The basic structure of let is:</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">let(a=expr,</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> b=expr,</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> c=expr,</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> expr)</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">The first three name/value pairs are setting variables and the final expression is the program logic that will use the variables.</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">If we send the let expression with the timeseries to Solr it returns with:</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="white-space: pre;"><span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">{
"result-set": {
"docs": [
{
"test_dt": "2012-05-01T00:00:00Z",
"count(*)": 247007
},
{
"test_dt": "2012-06-01T00:00:00Z",
"count(*)": 247994
},
{
"EOF": true,
"RESPONSE_TIME": 9
}
]
}
}</span></span><br />
<span style="white-space: pre;"><span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">This is the exact same response we would get if we sent the timeseries expression alone. Thats because all we did was assign the expression to a variable and use <i><b>get</b></i> to stream out the results.</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><b>Implementation Note: </b>Under the covers the let expression sets each variable by executing the expressions and adding the tuples to a list. It then maps the variable name to the list in memory so that it can be retrieved by the variable name. So in memory Streams are converted to <b>lists of tuples</b>.</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<br />
<h3>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">The S<b>elect</b> Expression</span></h3>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">The select expression has been around for a long time, but it now plays a central role in math expressions. The select expression wraps another expression and applies a list of Stream Evaluators to each tuple. Stream Evaluators perform operations on the tuples. </span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">The Streaming Expression library now includes a base set of <b>numeric evaluators</b> for performing math on tuples. Here is an example of select in action:</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">let(a=timeseries(collection, field="test_dt", q="*:*",</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> start="2012-05-01T00:00:00Z",</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> end="2012-06-30T23:59:59Z",</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> gap="+1MONTH", </span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> count(*)),</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> b=<b>select</b>(get(a), </span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> mult(-1, count(*)) as negativeCount,</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> test_dt),</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> get(b))</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">In the example above we've set a timeseries to variable <b><i>a</i>.</b></span><br />
<b><span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></b>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">Then we are doing something really interesting with variable<b> <i>b</i></b>. We are transforming the timeseries tuples stored in variable <b style="font-style: italic;">a </b>with the select expression. </span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">The select expression is reading all the tuples from the get(a) expression and applying the <b>mult</b> stream evaluator to each tuple. The mult Streaming Evaluator is multiplying -1 to the value in the count(*) field of the tuples and assigning it to the field negativeCount. Select is also outputting the test_dt field from the tuples.</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">The transformed tuples are then assigned to variable <b><i>b.</i></b></span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">Then get(b) is used to output the transformed tuples. If you send this expression to Solr it outputs:</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="white-space: pre;"><span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">{
"result-set": {
"docs": [
{
"test_dt": "2012-05-01T00:00:00Z",
"negativeCount": -247007
},
{
"test_dt": "2012-06-01T00:00:00Z",
"negativeCount": -247994
},
{
"EOF": true,
"RESPONSE_TIME": 9
}
]
}
}</span></span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><b>Implementation Note: </b>The <i style="font-weight: bold;">get </i>expression<i style="font-weight: bold;"> </i>creates new tuples when it streams tuples from a variable. So you never have to worry about <b>side effects</b>. In the example above variable<b><i> a</i></b> was unchanged when the tuples were transformed and assigned to variable <b><i>b</i></b>.</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<br />
<h3>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif; font-size: small;">The Tuple Expression</span></h3>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">The basic data structure of Streaming Expressions is a Tuple. A Tuple is a set of name/value pairs. In the 6.6 release of Solr there is a Tuple expression which allows you to create your own output tuple. Here is the sample syntax:</span></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">let(a=timeseries(collection, field="test_dt", q="*:*",</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> start="2012-05-01T00:00:00Z",</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> end="2012-06-30T23:59:59Z",</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> gap="+1MONTH", </span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> count(*)),</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> b=<b>select</b>(get(a), </span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> mult(-1, count(*)) as negativeCount,</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> test_dt),</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> tuple(seriesA=a,</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> seriesB=b))</span></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">The example above defines an output tuple with two fields: <b>seriesA</b> and <b>seriesB</b>, both of these fields have been assigned a variable. Remember that variables <b><i>a</i></b> and <i style="font-weight: bold;">b </i>are pointers to lists of tuples. This is exactly how they will be output by the tuple expression.</span></div>
<div>
<b><span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></b></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">If you send the expression above to Solr it will respond with:</span></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div>
<span style="white-space: pre;"><span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">{
"result-set": {
"docs": [
{
"seriesA": [
{
"test_dt": "2012-05-01T00:00:00Z",
"count(*)": 247007
},
{
"test_dt": "2012-06-01T00:00:00Z",
"count(*)": 247994
}
],
"seriesB": [
{
"test_dt": "2012-05-01T00:00:00Z",
"negativeCount": -247007
},
{
"test_dt": "2012-06-01T00:00:00Z",
"negativeCount": -247994
}
]
},
{
"EOF": true,
"RESPONSE_TIME": 7
}
]
}
}</span></span></div>
<div>
</div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">Now we have both the original time series and the transformed time series in the output.</span></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></div>
<h3>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif; font-size: small;">The Col Evaluator</span></h3>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">Lists of tuples are nice, but for performing many math operations what we need are columns of numbers. There is a special evaluator called <i style="font-weight: bold;">col </i>which can be used to pull out a column of numbers from a list of tuples.</span></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">Here is the basic syntax:</span></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">let(a=timeseries(collection, field="test_dt", q="*:*",</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> start="2012-05-01T00:00:00Z",</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> end="2012-06-30T23:59:59Z",</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> gap="+1MONTH", </span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> count(*)),</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> b=<b>select</b>(get(a), </span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> mult(-1, count(*)) as negativeCount,</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> test_dt),</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> c=col(a, count(*)),</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> d=col(b, negativeCount),</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> tuple(seriesA=a,</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> seriesB=b,</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> columnC=c,</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> columnD=d))</span></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">Now we have two new variables <b><i>c</i></b> and <b style="font-style: italic;">d, </b>both pointing to a col expression. The col expression takes two parameters. The first parameter is a <b>variable</b> pointing to a list of tuples. The second parameter is the field to pull the column data from.</span></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">Also notice that there are two new fields in the output tuple that output the columns. If you send this expression to Solr it responds with:</span></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div>
<span style="white-space: pre;"><span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">{
"result-set": {
"docs": [
{
"seriesA": [
{
"test_dt": "2012-05-01T00:00:00Z",
"count(*)": 247007
},
{
"test_dt": "2012-06-01T00:00:00Z",
"count(*)": 247994
}
],
"seriesB": [
{
"test_dt": "2012-05-01T00:00:00Z",
"negativeCount": -247007
},
{
"test_dt": "2012-06-01T00:00:00Z",
"negativeCount": -247994
}
],
"columnC": [
247007,
247994
],
"columnD": [
-247007,
-247994
]
},
{
"EOF": true,
"RESPONSE_TIME": 6
}
]
}
}</span></span></div>
<div>
<span style="white-space: pre;"><span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></span></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><span style="white-space: pre;">Now</span><span style="white-space: pre;"> the columns appear in the output.</span></span></div>
<div>
<span style="white-space: pre;"><span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></span></div>
<h3>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif; font-size: small;"><span style="white-space: pre;">Performing Math on Columns</span></span></h3>
<div>
<span style="white-space: pre;"><span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></span></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">We've seen already that there are numeric Stream Evaluators that work on tuples in the<b> select</b> expression.</span></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">Some numeric evaluators also work on columns. An example of this is the <b>corr</b> evaluator which performs the <b>Pearson product-moment correlation</b> calculation on two columns of numbers.</span></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">Here is the sample syntax:</span></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">let(a=timeseries(collection, field="test_dt", q="*:*",</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> start="2012-05-01T00:00:00Z",</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> end="2012-06-30T23:59:59Z",</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> gap="+1MONTH", </span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> count(*)),</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> b=<b>select</b>(get(a), </span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> mult(-1, count(*)) as negativeCount,</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> test_dt),</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> c=col(a, count(*)),</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> d=col(b, negativeCount),</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> tuple(seriesA=a,</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> seriesB=b,</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> columnC=c,</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> columnD=d,</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"> correlation=corr(c, d)))</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">Notice that the tuple now has a new field called <b>correlation</b> with the output of the corr function set to it. If you send this to Solr it responds with:</span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="white-space: pre;"><span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">{
"result-set": {
"docs": [
{
"seriesA": [
{
"test_dt": "2012-05-01T00:00:00Z",
"count(*)": 247007
},
{
"test_dt": "2012-06-01T00:00:00Z",
"count(*)": 247994
}
],
"seriesB": [
{
"test_dt": "2012-05-01T00:00:00Z",
"negativeCount": -247007
},
{
"test_dt": "2012-06-01T00:00:00Z",
"negativeCount": -247994
}
],
"columnC": [
247007,
247994
],
"columnD": [
-247007,
-247994
],
"correlation": -1
},
{
"EOF": true,
"RESPONSE_TIME": 6
}
]
}
}</span></span><br />
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<br />
<h3>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif; font-size: small;">Opening the Door to the Wider World of Mathematics</span></h3>
</div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">The syntax described in this blog opens the door to more sophisticated mathematics. For example the <b>corr</b> function can be used as a building block for <b>cross-correlation</b>, <b>auto-correlation</b> and <b>auto-regression</b> functions. Apache Commons Math includes machine learning algorithms such as clustering and regression and data transformations such as Fourier transforms that work on columns of numbers.</span></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">In the near future the Streaming Expressions math library will include these functions and many more.</span></div>
Joel Bernsteinhttp://www.blogger.com/profile/15670652646728608504noreply@blogger.comtag:blogger.com,1999:blog-447314081038493192.post-52452956468335125772017-04-19T10:27:00.000-07:002017-04-19T19:39:50.853-07:00Having a chat with Solr using the new echo Streaming ExpressionIn the next release of Solr, there is a new and interesting Streaming Expression called <b>echo</b>.<br />
<br />
echo is a very simple expression with the following syntax:<br />
<br />
echo("Hello World")<br />
<br />
If we send this to Solr, it responds with:<br />
<br />
<span style="font-family: monospace; font-size: 12px; white-space: pre;">{
"result-set": {
"docs": [
{
"echo": "Hello World"
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}</span><br />
<br />
Solr simply echoes the text back, but maybe it feels a bit like Solr is talking to us. Like there might be someone there.<br />
<br />
Well it turns out that this simple exchange is the first step towards a more meaningful conversation.<br />
<br />
Let's take another step:<br />
<br />
classify(echo("Customer service is just terrible!"),<br />
model(models, id="sentiment"),<br />
field="echo",<br />
analyzerField="message_t")<br />
<br />
Now we are echoing text to a classifier. The classify function is pointing to a<b> model </b>stored in Solr that does sentiment analysis based on the text. Notice that the classify function has an analyzer field parameter. This is a Lucene/Solr analyzer used by the classify function to pull the features from the text (See this <a href="http://joelsolr.blogspot.com/2017/01/deploying-ai-alerting-system-with-solrs.html" target="_blank">blog</a> for more details on the classify function).<br />
<br />
If we send this to Solr we may get a response like this:<br />
<br />
<span style="font-family: monospace; font-size: 12px; white-space: pre;">{
"result-set": {
"docs": [
{
"echo": "</span>Customer service is just terrible!<span style="font-family: monospace; font-size: 12px; white-space: pre;">",</span><br />
<span style="font-family: monospace; font-size: 12px; white-space: pre;"> "probability_d":0.94888
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}</span><br />
<span style="font-family: monospace; font-size: 12px; white-space: pre;"><br /></span>
The probability_d field is the probability that the text has a negative sentiment. In this case there was a 94% probability that the text was negative.<br />
<br />
Now Solr knows something about what's being said. We can wrap other Streaming Expressions around this to take actions or begin to formulate a response.<br />
<br />
But we really don't yet have enough information to make a very informed response. <br />
<br />
We can take this a bit further.<br />
<br />
Consider this expression:<br />
<br />
select(echo("Customer service is just terrible!"),<br />
analyze(echo, analyzerField) as expr_s)<br />
<br />
The expression above uses the <b>select </b>expression to <b>echo </b>the<b> </b>text to the <b>analyze</b> Stream Evaluator. The <b>analyze</b> Steam Evaluator applies a Lucene/Solr analyzer to the text and returns a token stream. But in this case it returns a single token which is a <b>Streaming Expression. </b><br />
<b><br /></b>
(See this <a href="http://joelsolr.blogspot.com/2017/03/streaming-nlp-is-coming-in-solr-66.html" target="_blank">blog</a> for more details on the analyze Stream Evaluator)<br />
<b><br /></b>
In order to make this work you would define the final step of the analyzer chain as a<b> token filter</b> that builds a Streaming Expression based on the natural language parsing done earlier in the analyzer chain.<br />
<br />
Now we can wrap this construct in the new <b>eval</b> expression:<br />
<br />
eval(select(echo("Customer service is just terrible!"),<br />
analyze(echo, analyzerField) as expr_s))<br />
<br />
The <b>eval</b> expression will <b>compile and run</b> the Streaming Expression created by the analyzer. It will also emit the tuples that are emitted by the compiled expression. The tuples emitted are the response to the natural language request.<br />
<br />
The heavy lifting is done in the analysis chain which performs the NLP and generates the Streaming Expression response.<br />
<b><br /></b>
<b>Streaming Expressions as an AI Language</b><br />
<br />
Before Streaming Expressions existed Dennis Gove shared an email with me with his initial design for the Streaming Expression syntax. The initial syntax used Lisp like S-Expressions. I took one look at the S-Expressions and realized we were building an AI language. I'll get into more detail about how this syntax ties into AI shortly, but first a little more history on Streaming Expressions.<br />
<br />
The S-Expressions were replaced with the more familiar function syntax that Streaming Expressions has today. This decision was made by Dennis and Steven Bower. It turned out to be the right call because we now have a more familiar syntax than Lisp but we also kept many of Lisps most important qualities.<br />
<br />
Dennis contributed the Streaming Expression parser and I began looking for something interesting to do with it. The very first thing I tried to do with Streaming Expressions was to re-write SQL queries as Streaming Expressions for the Parallel SQL interface. For this project a SQL parser was used to parse the queries and then a simple planner was built that generated Streaming Expressions to implement the physical query plan.<br />
<br />
This was an important proving ground for Streaming Expressions for a number of reasons. It proved that Streaming Expressions could provide the functionality needed to implement the SQL query plans. It proved that Streaming Expressions could push functionality down into the search engine and also rise above the search engine using MapReduce when needed.<br />
<br />
Most importantly from an AI standpoint it proved that we could easily generate Streaming Expressions<b> programmatically</b>. This was one of the key features that made Lisp a useful AI Language. The reason that Streaming Expressions are so easily generated is that the syntax is extremely regular. There are only nested functions. And because Streaming Expressions have an underlying Java object representation, we didn't have to do any String manipulation. We could work directly with the Object tree structure to build the expressions.<br />
<br />
Why is code generation important for AI? One of the reasons is shown earlier in this blog. A core AI use case is to respond to natural language requests. One approach to doing this is to analyze the text request and then generate code to implement a response. In many ways it's similar to the problem of translating SQL to a physical query plan.<br />
<br />
In a more general sense code generation is important in AI because you're dealing with many unknowns so it can be difficult to code everything up front. Sometimes you may need to generate logic on the fly.<br />
<br />
<b>Domain Specific Languages</b><br />
<br />
Lisp has the capability of adapting its syntax for specific domains through it's powerful macro feature. Streaming Expressions has this capability as well, but it does it a different way.<br />
<br />
Each Streaming Expression is implemented in Java under the covers. Each Streaming Expression is responsible for parsing it's own parameters. This means you can have Streaming Expressions that invent their own little languages. The <b>select</b> expression is a perfect example of this.<br />
<br />
The basic select expression looks like this:<br />
<br />
select(expr, fielda as outField)<br />
<br />
This <b>select</b> reads tuples from a stream and outputs fielda as outField. The Streaming Expression parser has no concept of the word "as". This is specific to the select expression and the select expression handles the parsing of "as".<br />
<br />
The reason why this works is that under the covers Streaming Expressions see all parameters as <b>lists</b> that it can manipulate any way it wants.<br />
<b><br /></b>
<b>Embedded In a Search Engine</b><br />
<br />
Having an AI language embedded in a search engine is a huge advantage. It allows expressions to leverage vast amounts of information in interesting ways. The inverted index already has important statistics about the text which can be used for machine learning. Search engines have strong facilities for working with text (tokenizers, filters etc..) and in recent years they've become powerful column stores for numeric calculations. They also have mature content ingestion and parallel query frameworks.<br />
<br />
Now there is a language that ties it all together.Joel Bernsteinhttp://www.blogger.com/profile/15670652646728608504noreply@blogger.comtag:blogger.com,1999:blog-447314081038493192.post-12466968575276410072017-03-30T14:11:00.004-07:002017-03-31T00:53:38.557-07:00Streaming NLP is coming in Solr 6.6Solr 6.5 is out now, so it's time to start thinking about the next release. One of the interesting features coming in Solr 6.6 is <b>Streaming NLP</b>. This exciting new feature is already committed and waiting for release. This blog will describe how Streaming NLP works.<br />
<br />
<h3>
The <b><span style="color: #660000;">analyze</span></b> Stream Evaluator</h3>
<br />
One of the features added in Solr 6.5 was Stream Evaluators. Stream Evaluators perform operations on Tuples in the stream. There are already a rich set of <b>math</b> and <b>boolean </b>Stream Evaluators in Solr 6.5 and more coming in Solr 6.6. The math and boolean Stream Evaluators allow you to build complex boolean logic and mathematical formulas on Tuples in the stream.<br />
<br />
Solr 6.6 also has a new Stream Evaluator, called <b>analyze</b>, that works with text. The analyze evaluator applies a Lucene/Solr analyzer to a text field in the Tuples and returns a list of tokens produced by the analyzer. The tokens can then by used to annotate Tuples or streamed out as Tuples. We'll show examples of both approaches later in the blog.<br />
<br />
But it's useful to talk about the power behind Lucene/Solr analyzers first. Lucene/Solr has a large set of analyzers that tokenize different languages and apply filters that transform the token stream. The "analyzer chain" design allows you to chain tokenizers and filters together to perform very powerful text transformations and extractions.<br />
<br />
The analysis chain also provides a pluggable API for adding new NLP tokenizers and filters to Solr. New tokenizers and filters can be added and then layered with existing tokenizers and filters in interesting ways. New NLP analysis chains can then be used both during indexing and with Streaming NLP.<br />
<br />
<h3>
The <span style="color: #660000;">cartesianProduct</span> Streaming Expression</h3>
<br />
The cartesianProduct Streaming Expression is also new in Solr 6.6. The cartesianProduct expression emits a <b>stream of Tuples</b> from a single Tuple by creating a cartesian product from a multi-valued field or a <b>text</b> field. The <b>analyze</b> Stream Evaluator is used with the cartesianProduct Streaming Expression to create a cartesian product from a text field.<br />
<br />
Here is a very simple example:<br />
<br />
For this example we have indexed a single record in Solr with an id and text field called body:<br />
<br />
id: 1<br />
body: "c d e f g"<br />
<br />
The following expression will create a cartesian product from this Tuple:<br />
<br />
<b>cartesianProduct</b>(<b>search</b>(collection, q="id:1", fl="id, body", sort="id desc"),<br />
<b>analyze</b>(body, analyzerField) as outField)<br />
<br />
First let's look at what this expression is doing then look at the output.<br />
<br />
The cartesianProduct expression is wrapping a <b>search</b> expression and an <b>analyze</b> Stream Evaluator. The cartesianProduct expression reads the Tuples returned by the search expression and applies the analyze Stream Evaluator to each Tuple. (Note that the cartesianProduct expression can read Tuples from any Streaming Expression.)<br />
<br />
The analyze Stream Evaluator is taking the text from the body field in the Tuple and is applying an analyzer found in the schema which is pointed to by the <b>analyzerField</b> parameter.<br />
<br />
The cartesianProduct function emits a single Tuple for each token produced by the analyzer. For example if we have a basic white space tokenizing analyzer the Tuples emitted would be:<br />
<br />
id: 1<br />
outField: c<br />
<br />
id: 1<br />
outField: d<br />
<br />
id: 1<br />
outField: e<br />
<br />
id: 1<br />
outField: f<br />
<br />
id: 1<br />
outField: g<br />
<br />
<h3>
Creating <span style="color: #660000;">Entity Graphs</span></h3>
<br />
The Tuples emitted by the cartesianProduct and the analyze evaluator can be saved to another Solr Cloud collection with the <b>update</b> stream. This allows you to build graphs from extracted entities that can then be walked with Solr <b>Graph Expressions</b>.<br />
<br />
<h3>
Annotating Tuples</h3>
<br />
The analyze Stream Evaluator can also be used with the <b>select</b> Streaming Expression to annotate Tuples with tokens extracted by an analyzer. Here is the sample syntax:<br />
<br />
select(search(collection, q="id:1", fl="id, body", sort="id desc"),<br />
id,<br />
analyze(body, analyzerField) as outField)<br />
<br />
This will add a field to each Tuple which will contain the list of Tuples extracted by the analyzer. The update function can be used to save the annotated Tuples to another Solr Cloud collection.<br />
<br />
<h3>
Scaling Up</h3>
<div>
<br /></div>
<div>
Solr's <b>parallel batch</b> and <b>executor</b> framework can be used to apply a massive amount of computing power to perform NLP on extremely large data sets. You can read about the parallel batch and the executor framework in these blogs:</div>
<div>
<br /></div>
<div>
<a href="http://joelsolr.blogspot.co.uk/2016/10/solr-63-batch-jobs-parallel-etl-and.html">http://joelsolr.blogspot.co.uk/2016/10/solr-63-batch-jobs-parallel-etl-and.html</a></div>
<div>
<a href="http://joelsolr.blogspot.co.uk/2017/01/deploying-solrs-new-parallel-executor.html">http://joelsolr.blogspot.co.uk/2017/01/deploying-solrs-new-parallel-executor.html</a></div>
<br />
<br />Joel Bernsteinhttp://www.blogger.com/profile/15670652646728608504noreply@blogger.comtag:blogger.com,1999:blog-447314081038493192.post-62992700949618111192017-03-12T19:16:00.000-07:002017-06-04T16:34:45.260-07:00Solr 6.5: Retrieve and rank with graph expressionsThis blog describes how to retrieve and rank documents with graph expressions. First let's define exactly what it means to retrieve and rank with a graph expression and then we'll walk through an example.<br />
<br />
The <b>retrieve</b> step is a relevance ranked search. The <b>rank </b>step re-ranks the top N documents based on the results of a graph expression.<br />
<br />
Why would we want to do this? I think its easiest to explain this with an example.<br />
<br />
<h3>
Re-Ranking Based On A Users "Work Graph"</h3>
<br />
Before diving into the example, it's important to understand that this re-ranking strategy is designed to provide sub-second response times. It's also designed to adapt in real-time as users use the system and work graphs are updated.<br />
<br />
Ok, let's dive into the example.<br />
<br />
In this example when users perform a search the top N results are re-ranked to boost documents that are part of their work graph. To find a users work graph, a graph expression is used to mine usage logs in real time to find documents that are closely related to the users work.<br />
<br />
This relevance strategy can be useful for systems where users are working with documents and performing searches to find documents. One example of this type of system is <b>Alfresco</b>, which provides an Enterprise Content Management system, that uses Solr for search. Alfresco logs when users read and edit documents. These logs can then be mined with graph expressions to discover users work graphs.<br />
<br />
<br />
<h3>
The Re-Rank Expression</h3>
<br />
The re-rank expression looks like this:<br />
<br />
top(n=50,<br />
sort="rescore desc",<br />
select(id,<br />
if(eq(nodeScore, null), score, mult(score, log(nodeScore))) as rescore,<br />
outerHashJoin(${search}, hashed=${graph}, on="id=node"))) <br />
<br />
Notice the <b>outerHashJoin </b>refers to ${search} and ${graph} variables. This is using Solr's built in <a href="https://cwiki.apache.org/confluence/display/solr/Parameter+Substitution" target="_blank">macro expansion</a> capability. ${search} and ${graph} are referring to http parameters that point to the search and graph Streaming Expressions. This is a great way to break up long Streaming Expressions into manageable pieces and also create re-usable parameterized templates.<br />
<br />
We'll first explore the re-rank expression above, then we'll look at the ${search} and ${graph} expressions.<br />
<br />
Let's start pulling apart the re-rank expression by looking at the <b>outerHashJoin </b>expression. The <a href="https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions#StreamingExpressions-outerHashJoin" target="_blank">outerHashJoin</a> expression joins two expressions ${search} and ${graph}. The join keys are the <b>id</b> field from the ${search} tuples and <b>node </b>field from the ${graph} tuples.<br />
<br />
The outerHashJoin emits all tuples from the ${search} expression whether there is a matching tuple from the ${graph} expression or not. If there is a match found from the ${graph} expression then it's fields are added to the matching ${search} tuple.<br />
<br />
We'll look at the specifics of the ${search} and ${graph} expression below, but at a high level they are:<br />
<br />
1) search: A full text search result.<br />
2) graph: The documents that are closely related to the users work a.k.a. the users work graph.<br />
<br />
Let's move on to the <b><a href="https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions#StreamingExpressions-select" target="_blank">select</a> </b>expression that is wrapping the outerHashJoin. The select function selects specific fields from tuples and performs field level transformations on tuples. These field level operations known as <a href="https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions#StreamingExpressions-StreamEvaluators" target="_blank">Evaluators</a> were significantly expanded by <a href="https://github.com/dennisgove" target="_blank">Dennis Gove</a> in Solr 6.5.<br />
<br />
In the example, the select function operates over each tuple emitted by the outerHashJoin. It emits the <b>id </b>field for every tuple and a new derived field called <b>rescore. </b><br />
<b><br /></b>
The <b>rescore</b> field is derived from a specific formula in red below:<br />
<span style="color: #990000;"><br /></span>
<span style="color: #990000;">if(eq(nodeScore, null), score, mult(score, log(nodeScore))) </span>as rescore<br />
<br />
This formula is expressed using the new Evaluators. Translated into plain english the formula is:<br />
<br />
if the <b>nodeScore</b> field is null, then use the <b>score</b> field.<br />
else<br />
multiply the <b>score</b> field by the natural log of the <b>nodeScore</b> field.<br />
<br />
The <b>nodeScore</b> field is assigned to documents emitted by the ${graph} expression. It describes how relevant the document is to the users <b>work graph</b>.<br />
<br />
The <b>score</b> is the score assigned to documents by the ${search} expression. It describes how relevant the document is to the full text search.<br />
<br />
Notice in the formula that the <b>score</b> is always present. But the <b>nodeScore</b> can be null. This is because only documents in the search result that are in the users work graph will have a nodeScore assigned during the outer join.<br />
<br />
Also notice the tuples that contain a nodeScore are boosted by multiplying the log of the nodeScore and the score. The documents that don't have a nodeScore don't receive this boost. This boosts documents that are part of the users work graph.<br />
<br />
In the final step the <b>top</b> expression emits the top 50 tuples sorted by rescore desc. This is the re-ranked result set.<br />
<br />
We spent quite a bit of time going through the re-rank expression, so let's spend a little time on the ${search} and ${graph} expressions.<br />
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
<br /></h3>
<h3>
The Search Expression</h3>
<br />
In this example we'll use a very simple search expression that looks like this:<br />
<br />
search(content,<br />
q="natural gas",<br />
fl="id, score",<br />
rows="100",<br />
sort="score desc")<br />
<br />
This expression searches the content collection in the default field for the terms <b>natural gas. </b>The expression will return the <b>id</b> and <b>score</b> fields and sort by <b>score descending</b>. The rows parameter is set to 100 which means it will fetch 100 rows from each shard, rather the 100 rows total. So if there are 4 shards this will return up to 400 results.<br />
<br />
The search expression is really designed to provide input to other streaming expressions, so it simply merges the results from the shards into a single stream and maintains the sort order.<br />
<br />
<h3>
The Graph Expression</h3>
<br />
The graph expression is designed to query usage logs to return documents that are part of a users <b>work graph</b>.<br />
<br />
Here is the graph expression we will be using for this example:<br />
<br />
scoreNodes<b>(</b><span style="color: #bf9000;">nodes</span>(logs,<br />
<span style="color: lime;">top</span>(n=20,<br />
sort="count(*) desc",<br />
<span style="color: blue;">nodes</span>(logs,<br />
<span style="color: #990000;">nodes</span>(logs,<br />
walk="joel->userID",<br />
gather="contentID"),<br />
walk="node->contentID",<br />
gather="userID",<br />
count(*))),<br />
walk="node->userID",<br />
gather="contentID",<br />
count(*))) <br />
<br />
<br />
Working our way outwards from the innermost nodes expression (Note that <b>nodes</b> is an alias for the <a href="https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions#StreamingExpressions-gatherNodes" target="_blank">gatherNodes</a> expression):<br />
<br />
1) The innermost<span style="color: #990000;"> <b>nodes</b> </span>expression gathers all contentID's from the logs where the userID is joel.<br />
2) Working outwards, the next <b><span style="color: blue;">nodes</span></b> expression takes all the contentID's emitted from step 1 and gathers all the userID's that have viewed these contentID's. It also counts how many contentID's each user has viewed that joel has viewed.<br />
3) The <b><span style="color: lime;">top</span> </b>expression emits the top 20 users that have viewed the most overlapping content with joel.<br />
4) The outermost <span style="color: #bf9000;">nodes</span> expression gathers all the contentID's viewed by the users emitted in step 3.<br />
5) The <a href="https://cwiki.apache.org/confluence/display/solr/Graph+Traversal#GraphTraversal-UsingthescoreNodesFunctiontoMakeaRecommendation" target="_blank">scoreNodes</a> expression scores all the contentID's emitted by step 4. This adds the <b>nodeScore</b> field to the tuples which describes how relevant each contentID is to the users work graph.<br />
<br />
This graph expression will emit all the contentID's in the users <b>work graph</b>. The contentID in each tuple will be in the <b>node</b> field. This is why the outerHashJoin in the re-rank expression is joining the <b>id</b> field in the ${search} expression to the <b>node</b> field in the ${graph} expression.<br />
<br />Joel Bernsteinhttp://www.blogger.com/profile/15670652646728608504noreply@blogger.comtag:blogger.com,1999:blog-447314081038493192.post-81026105972282017132017-02-27T19:24:00.002-08:002017-03-22T12:05:55.589-07:00Anomaly Detection in Solr 6.5Solr 6.5 is just around the corner and along with it comes the new <b>significantTerms</b> Streaming Expression. The significantTerms expression queries a Solr Cloud collection but instead of returning the matching documents, it returns the significant terms in the matching documents.<br />
<br />
To determine the significance of a term a formula is used which considers the number of times the term appears in the <b>foreground</b> set versus the number of times the term appears in the <b>background</b> set. The foreground set is the <b>search result</b>. The background set is <b>all the documents</b> in the index.<br />
<br />
The significantTerms function assigns higher scores to terms that are more <b>frequent</b> in the foreground set and <b>rarer</b> in the background set, in relation to other terms.<br />
<br />
<b>For example:</b><br />
<br />
<b>Term Foreground Background</b><br />
A 100 103<br />
B 101 1000<br />
<br />
Term <b>A</b> would be considered more significant then term <b>B, </b>because term <b>A</b> is much more rare in the background set.<br />
<br />
This model for scoring terms can be very useful for spotting <b>anomalies</b> in the data. Specifically we can easily surface terms that are <b>unusually aligned</b> with specific result sets.<br />
<h3>
<br />A Simple Example with the Enron Emails</h3>
<br />
For this example we'll start with a single Enron email address (tana.jones@enron.com) and ask the question:<br />
<b><br /></b>
<b>Which address has the most significant relationship with tana.jones@enron.com</b>?<br />
<br />
We can start looking for an answer by running an aggregation. Since we're using Streaming Expressions we'll use the facet expression:<br />
<br />
facet(enron,<br />
q="from:tana.jones@enron.com",<br />
buckets="to",<br />
bucketSorts="count(*) desc",<br />
bucketSizeLimit="100",<br />
count(*))<br />
<br />
This expression queries the index for <b>tana.jones@enron.com </b>in the <b><i>from</i></b> field and gathers the facet <b>buckets</b> and <b>counts</b> from the <b><i>to</i></b> field. It returns the top 100 facet buckets from the <b><i>to</i></b> field ordered by the counts in descending order.<br />
<br />
This expression returns the top 100 addresses that tana.jones@enron.com has emailed. The top five results look like this:<br />
<br />
{<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"result-set": {<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"docs": [{<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"count(*)": 789,<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"to": "alan.aronowitz@enron.com"<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>}, {<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"count(*)": 376,<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"to": "frank.davis@enron.com"<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>}, {<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"count(*)": 372,<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"to": "mark.taylor@enron.com"<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>}, {<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"count(*)": 249,<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"to": "brent.hendry@enron.com"<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>}, {<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"count(*)": 197,<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"to": "bob.bowen@enron.com"<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>}, ...<br />
<br />
This gives some useful information but does it answer the question? The top address is alan.aronowitz@enron.com with a count of 789. Is this the most significant relationship?<br />
<br />
Let's see if the <b>significantTerms</b> expression can surface an anomaly. Here is the expression:<br />
<br />
significantTerms(enron, q="from:tana.jones@enron.com", field="to", limit="20")<br />
<br />
The expression above runs the query <b>from:tana.jones@enron.com </b>on the enron collection. It then collects the top <b>20</b> significant terms from the <b><i>to</i></b> field.<br />
<br />
The top five results look like this:<br />
<br />
{<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"result-set": {<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"docs": [{<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"score": 54.370163,<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"term": "michael.neves@enron.com",<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"foreground": 130,<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"background": 132<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>}, {<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"score": 53.911552,<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"term": "lisa.lees@enron.com",<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"foreground": 186,<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"background": 243<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>}, {<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"score": 53.806202,<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"term": "frank.davis@enron.com",<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"foreground": 376,<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"background": 596<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>}, {<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"score": 51.760098,<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"term": "harry.collins@enron.com",<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"foreground": 106,<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"background": 150<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>}, {<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"score": 51.471268,<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"term": "edmund.cooper@enron.com",<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"foreground": 132,<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"background": 222<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>}<br />
<br />
We have indeed surfaced an interesting anomaly. The first term is <b>michael.neves@enron.com.</b> This address has a foreground count of 130 and background count of 132. This means that michael.neves@enron.com has received 132 emails in the entire corpus and 130 of them have been from <b>tana.jones@enron.com</b>. This signals a strong connection.<br />
<b><br /></b>
<b>alan.aronowitz@enron.com</b>, the highest total receiver of emails from <b>tana.jones@enron.com, </b>isn't in the top 5 results from the significantTerms function.<br />
<br class="Apple-interchange-newline" />
alan.aronowitz@enron.com shows up at number 8 in the list:<br />
{<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"score": 49.847652,<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"term": "alan.aronowitz@enron.com",<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"foreground": 789,<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>"background": 2117<br />
}<br />
<br />
Notice that the foreground count is 789 and background count is 2117. This means that 37% of the emails received by <b>alan.aronowitz@enron.com</b> were from<b> tana.jones@enron.com.</b><br />
<b><br /></b>
98% of the emails received by <b>michael.neves@enron.com</b> came from <b>tana.jones@enron.com</b>.<br />
<br />
<h3>
significantTerms VS scoreNodes</h3>
<br />
The significantTerms function works directly with the inverted index and can score terms from a single-value, multi-value and text fields.<br />
<br />
The scoreNodes function scores tuples emitted by graph expressions. This allows for anomaly detection in distributed graphs. A prior <a href="http://joelsolr.blogspot.com/2017/02/recommendations-with-solrs-graph.html" target="_blank">blog</a> covers the scoreNodes function in more detail.<br />
<br />
In Solr 6.5 the scoreNodes scoring algorithm was changed to better surface anomalies. The significantTerms and scoreNodes functions now use the same scoring algorithm.<br />
<br />
<h3>
Use Cases</h3>
<br />
Anomaly detection has interesting use cases including:<br />
<br />
1) <b>Recommendations:</b> Finding products that are unusually connected based on past shopping history. <br />
<br />
2) <b>Auto-Suggestion:</b> Suggesting terms that go well together based on indexed query logs.<br />
<br />
3) <b>Fraud Anomalies: </b>Finding vendors that are unusually associated with credit card fraud.<br />
<br />
4) <b>Text Analytics:</b> Finding significant terms relating to documents in a full text search result set.<br />
<br />
5) <b>Log Anomalies:</b> Finding IP addresses that are unusually associated with time periods of suspicious activity.Joel Bernsteinhttp://www.blogger.com/profile/15670652646728608504noreply@blogger.comtag:blogger.com,1999:blog-447314081038493192.post-35231943325079373122017-02-16T09:07:00.001-08:002017-03-22T12:03:25.732-07:00Solr's Shiny New Apache Calcite SQL Integration<br />
Solr's new Apache Calcite SQL integration is in Lucene/Solr's master branch now. This blog will discuss Solr's potential as a distributed SQL engine, some thoughts on Apache Calcite, what's currently supported with Solr's Apache Calcite integration and what might be coming next.<br />
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
<br /></h3>
<h3>
Solr as Distributed SQL Engine</h3>
<div>
<br /></div>
The initial SQL interface, which released with Solr 6.0, uses the Presto project's SQL parser to parse the SQL and then rewrites the queries as Solr Streaming Expressions.<br />
<br />
The goals of the initial SQL release were focused around supporting SQL aggregations using both <b>MapReduce</b> and Solr's native <b>faceting</b> capabilities. And also to support Solr's full text query language in the SQL predicate.<br />
<br />
Even with the limited set of goals it became clear that Solr could be a special SQL engine. There are very few distributed SQL engines that can push down so much processing into the engine, and also rise up above the engine when needed to perform streaming parallel relational algebra. And then of course there is the <b>predicate</b>. Few existing SQL engines can compete with Solr's rich search predicates which have been developed over a period of 10+ years.<br />
<br />
Last but not least is the performance. Solr is not a batch engine adapted for request/response. Solr is a request/response engine from the ground up. Solr's search performance, analytic performance and instant streaming capabilities make it one of the fastest distributed SQL engines available.<br />
<br />
<h3>
The Apache Calcite Integration</h3>
<br />
<a href="https://github.com/risdenk" target="_blank">Kevin Risden</a> broke ground on Solr's Apache Calcite integration in March of 2016. It wasn't clear at that time that Apache Calcite would be the right fit, but over time it became clear that it was exactly what we needed. There were really two choices for how we could implement the next phase of Solr's SQL integration:<br />
<br />
1) Stick with the approach of using a SQL Parser only, and work directly with the parse tree to build the physical query plan in Streaming Expressions.<br />
<br />
2) Use a broader framework and plugin rules that would push down parts of the SQL query that we wanted to control.<br />
<br />
There were pros and cons to both approaches. The main pro for using just a SQL parser is that we would have total control over the process once the query was parsed. This means we would never run into a scenario where we couldn't implement something that leveraged Solr's capabilities.<br />
<br />
The main con to just using the SQL parser is that we would have been responsible for implementing everything, including things like a complete <b>JDBC</b> driver. This is not a small undertaking.<br />
<br />
The main pros and cons of using a broader framework were exactly the opposite: less control but the ability to leverage existing features in the framework.<br />
<br />
Kevin was very much in favor of using a broader framework, but I was not convinced that we could take full advantage of Solr's capabilities unless we controlled everything.<br />
<br />
But in the end Kevin broke ground embedding the Apache Calcite framework into Solr. In the open source world, working code tends to win out. Based on his initial work, I agreed that we should move forward using the wider Apache Calcite framework.<br />
<br />
Kevin continued working on the integration. Along the way <a href="https://github.com/CaoManhDat" target="_blank">Cao Manh Dat</a> joined in and added the aggregation support to the branch that Kevin was working on.<br />
<br />
Eventually I joined in as well, building on top of the work that Kevin and Dat already contributed. My main focus was to ensure that the initial Apache Calcite implementation was comparable in features and performance to Solr's existing SQL integration.<br />
<br />
As I spent more time working with Apache Calcite I came to really appreciate what the project offered. Apache Calcite allows you to selectively push down parts of the SQL implementation such as the predicate, sort, aggregation and joins. It gives you almost full control if you want it, but allows you to leverage any part of the framework that you choose to use. For example you can push down nothing if you want, or you can push down just the predicate or just the sort.<br />
<br />
Apache Calcite also provides two very important things: a cost based query optimizer and a JDBC driver. Solr's initial JDBC driver only implemented part of the specification and the specification is large. Implementing a cost based query optimizer is a daunting task. With Apache Calcite we get these features almost for free. We still have to provide hooks into them, but we don't have to implement them.<br />
<br />
<h3>
What's Currently Supported in Lucene/Solr Master</h3>
<br />
The initial Apache Calcite integration includes:<br />
<br />
<ul>
<li><b>limited</b> and <b>unlimited</b> selects. Unlimited selects stream the entire result set regardless of the size.</li>
<li>Support for Solr <b>search predicates</b> including support for embedding an entire Solr query using the "_query_" field. This means all Solr query syntax is supported including complex full text, graph query, geo-spatial, fuzzy, moreLikeThis etc.</li>
<li>Support for <b>score</b> in the field list and order by in queries that have a limit clause.</li>
<li>Support for field aliases.</li>
<li>Support for <b>faceted</b> and <b>MapReduce</b> aggregations with the aggregationMode parameter.</li>
<li>Support for aggregations on multi-valued fields when in facet mode.</li>
<li>Support for multi-value fields in simple selects.</li>
<li>Parallel execution of MapReduce aggregations on worker nodes.</li>
<li>Support for aggregations without a group by clause. These are always pushed down into the search engine.</li>
<li>Support for select distinct queries in both faceted and MapReduce mode.</li>
<li>Support for group by aggregations in both faceted and MapReduce mode.</li>
<li>Support for sorting on fields in the index as well as sorting on aggregations.</li>
<li>Support for the having clause. In MapReduce mode the having clause is pushed down to the worker nodes now.</li>
</ul>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
<br /></h3>
<h3>
What's Coming Next</h3>
<br />
Now that the Apache Calcite integration is in master we are free to begin adding new features on a regular basis. Here are some features that are on the top of the list:<br />
<br />
1) Support for * in the field list.<br />
2) Automatic selection of aggregationMode (facet or MapReduce). Having Solr choose the right aggregation mode based on the cardinality of fields being aggregated.<br />
3) Support for SELECT ... INTO ...<br />
4) Support for arithmetic operations on fields and aggregations (select (a*b) as c from t). This is now supported in Streaming Expressions (<a href="https://issues.apache.org/jira/browse/SOLR-9916" target="_blank">SOLR-9916</a>).<br />
5) Expanded aggregation support.<br />
5) Support for UNION, INTERSECT, JOIN, using Streaming Expressions parallel relational algebra capabilities and Apache Calcites query optimizer.Joel Bernsteinhttp://www.blogger.com/profile/15670652646728608504noreply@blogger.comtag:blogger.com,1999:blog-447314081038493192.post-54662979745311820142017-02-14T20:19:00.000-08:002017-02-15T16:49:05.906-08:00Recommendations With Solr's Graph Expressions Part 1: Finding Products That Go Well Together<a href="https://cwiki.apache.org/confluence/display/solr/Graph+Traversal" target="_blank">Graph Expressions</a> were introduced in Solr 6.1. Graph Expressions are part of the wider Streaming Expressions library. This means that you can combine them with other expressions to build complex and interesting graph queries.<br />
<br />
<b>Note:</b> If you're not familiar with graph concepts such as <b>nodes</b> and <b>edges</b> it may useful to first review the Wiki on <a href="https://en.wikipedia.org/wiki/Graph_theory" target="_blank">Graph Theory</a>.<br />
<br />
This blog is part one of a <b>three part </b>series on making recommendations with Graph Expressions.<br />
<br />
The three parts are:<br />
<br />
1) <span style="color: blue;">Finding products that go well together.</span><br />
2) <span style="background-color: white;"><span style="color: blue;">Using the crowd to find products that go well with a user.</span> </span><br />
3) <span style="color: blue;">Combining Graph Expressions to make a personalized recommendation.</span><br />
<br />
Before diving into the first part of the recommendation lets consider the data. For all three blogs we'll be using a simple SolrCloud Collection called <b>baskets </b>in the following format:<br />
<br />
<i><b>userID basketID</b> <b>productID </b></i><br />
user1 basket1 productA <br />
user1 basket1 productA <br />
user1 basket2 productL <br />
user2 basket3 productD<br />
user2 basket3 productM<br />
...<br />
<br />
The baskets collection holds all the products that have been added to baskets. Each record has a userID, basketID and productID. We'll be able to use Graph Expressions to mine this data for recommendations.<br />
<br />
One more quick note before we get started. One of the main expressions we'll be using is the <b>nodes </b>expression. The nodes expression was originally released as the<b> gatherNodes</b> expression. Starting with Solr 6.4 the nodes function name can be used as a shorthand for gatherNodes. You can still also use gatherNodes if you like, they are both a pointer to the same function.<br />
<br />
Now lets get started!<br />
<br />
<h3>
Finding Products That Go Well Together</h3>
<br />
One approach to recommending products is to start with a product the user has selected and find products that go well with that product.<br />
<br />
The Graph Expression below finds the products that go well with <b>productA</b>:<br />
<br />
scoreNodes(top(n=25,<br />
sort="count(*) desc",<br />
nodes(baskets,<br />
random(baskets, q="productID:productA", fl="basketID", rows="250"),<br />
walk="basketID->basketID",<br />
gather="productID",<br />
fq="-productID:productA",<br />
count(*))))<br />
<div>
<br /></div>
Let's explore how the expression works.<br />
<br />
<h3>
Seeding the Graph Expression</h3>
<br />
The inner random expression is used to seed the Graph Expression:<br />
<br />
random(baskets, q="productID:productA", fl="basketID", rows="250")<br />
<br />
The random expression is not a Graph Expression. But in this scenario its used to seed a Graph Expression with a set of root nodes to begin the traversal.<br />
<br />
The random expression returns a pseudo random set of results that match the query. In this case the random expression is returning 250 <b>basketsID</b>s that contain the productID productA.<br />
<br />
The random expression serves two important purposes in seeding the Graph Expression:<br />
<br />
<b>1)</b> It limits the scope of the graph traversal to 250 basketIDs. If we seed the graph traversal with all the basketIDs that have productA, we could potentially have a very large number of baskets to work with. This could cause a slow traversal and memory problems as Graph Expressions are tracked in memory.<br />
<br />
<b>2)</b> It adds an element of surprise to the recommendation by providing a different set of baskets each time. This can result in different recommendations because each recommendation is seeded with a different set of basketIDs.<br />
<h3>
<br />Calculating Market Basket Co-Occurrence with the Nodes Expression</h3>
<br />
Now lets explore the nodes expression which wraps the random expression. The nodes expression performs a breadth first graph traversal step, gathering nodes and aggregations along the way. For a full explanation of the nodes expression you can review the <a href="https://cwiki.apache.org/confluence/display/solr/Graph+Traversal" target="_blank">online documentation</a>.<br />
<br />
Lets look at exactly how the example nodes expression operates:<br />
<br />
<span style="background-color: white;">nodes(baskets, </span><br />
random(baskets, q="productID:productA", fl="basketID", rows="250"),<br />
walk="basketID->basketID",<br />
fq="-productID:productA",<br />
gather="productID",<br />
count(*))<br />
<br />
Here is an explanation of the parameters:<br />
<ol>
<li><b>baskets</b>: This is the collection that the nodes expression is gathering data from.</li>
<li><b>random expression</b>: Seeds the nodes expression with a set of pseudo random basketIDs that contain productA.</li>
<li><b>walk</b>: Walks a relationship in the graph. The<b> basketID->basketID </b>construct tells the nodes expression to take the basketID in the tuples emitted by the random expression and search them against the basketID in the index.</li>
<li><b>fq</b>: Is a filter query that filters the results of the <b>walk</b> parameter. In this case it filters out records with productA in the productID field. This stops productA from being a recommendation for itself.</li>
<li><b>gather</b>: Specifies what field to collect from the rows that are returned by the <b>walk</b> parameter. In this case it is gathering the <b>productID</b> field.</li>
<li><b>count(*)</b>: This is a graph aggregation, that counts the occurrences of what was gathered. In this case it counts how many times each productID was gathered. </li>
</ol>
<h3>
<div style="font-size: medium;">
<span style="font-weight: normal;">In plain english this </span><b>nodes</b><b style="font-weight: normal;"> </b><span style="font-weight: normal;">expression is gathering the productIDs that co-occur with productA in baskets, and counting how many times the products co-occur.</span></div>
<div>
<br /></div>
</h3>
<h3>
Scoring the Nodes To Find the Most Significant Product Relationships</h3>
<br />
With the output of the nodes expression we already know which products co-occur most frequently with productA. But there is something we don't know yet: <b>how often the products occur across all the baskets</b>. If a product occurs in a large percentage of baskets, then it doesn't have any particular relevance to productA.<br />
<br />
This is where the <b>scoreNodes</b> function does it's magic.<br />
<br />
<br />
scoreNodes(top(n=25,<br />
sort="count(*) desc",<br />
nodes(baskets,<br />
random(baskets, q="productID:productA", fl="basketID", rows="250"),<br />
walk="basketID->basketID",<br />
gather="productID",<br />
fq="-productID:productA",<br />
count(*))))<br />
<br />
In expression above the <b>top</b> function emits the top 25 products based on the co-occurrence count. The top 25 products are then scored by the <b>scoreNodes</b> function.<br />
<br />
The scoreNodes function scores the products based on the raw co-occurrence counts and their frequency across the entire collection.<br />
<br />
The scoring formula is similar to the tf*idf scoring algorithm used to score results from a full text search. In the full text context <b>tf</b> (term frequency) is the number of times the term appears in the document. <b>idf</b> (inverse document frequency) is computed based on the document frequency of the term, or how many documents the term appears in. The idf is used to provide a boost to rarer terms.<br />
<br />
scoreNodes uses the same principal to score nodes in a graph traversal. The <b>count(*) aggregation</b> is used as the <b>tf</b> value in the formula. The <b>idf</b> is computed for each node, in this case productID, based on global statistics across the entire collection. The effect of the scoreNodes algorithm is to provide a boost to nodes that are rarer in the collection.<br />
<br />
<div>
The scoreNodes functions adds a field to each node tuple called <b>nodeScore, </b>which is the relevance score for the node.</div>
<div>
<br /></div>
<div>
Now we know which products have the most significant relationship with productA.</div>
<h3>
<br />Can We Still Do Better?</h3>
<div>
<br /></div>
<div>
Yes. We now know which products have the most significant relationship with productA. But we don't know if the user will have an interest in the product(s) we're recommending. In the next blog in the series we'll explore a graph expression that uses connections in the graph to personalize the recommendation.</div>
<br />
<style type="text/css">
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo}
span.s1 {font-variant-ligatures: no-common-ligatures}
</style>Joel Bernsteinhttp://www.blogger.com/profile/15670652646728608504noreply@blogger.comtag:blogger.com,1999:blog-447314081038493192.post-79086587227460503052017-01-23T17:27:00.000-08:002017-03-16T13:14:58.275-07:00Deploying an AI alerting system with Solr's Streaming ExpressionsAs of Solr 6.4, Streaming Expressions provide all the tools you need to deploy an <b>Artificial Intelligence alerting system.</b> Let's take a look at some of the tools involved and then walk through a simple AI alerting use case.<br />
<h2>
<br /><span style="color: #660000;">The Tools</span></h2>
<br />
1) <b>A Text Classifier</b>: Solr's Streaming Expressions allow you to <b>train</b>, <b>store</b>, and <b>deploy</b> a text classifier. With a text classifier you can train a model to determine the probability that a document belongs to a certain class. This provides the brains behind the alert.<br />
<br />
2)<b> A Messaging System:</b> Solr's Streaming Expressions library provides the <b>topic</b> function, which allows clients to subscribe to a query. Once the subscription is established the topic provides one-time delivery of documents that match the topic query. This provides the nervous system of the alerting engine.<br />
<br />
3) <b>Actors:</b> Solr's Streaming Expressions library provides the <b>daemon</b> function, which are processes that live inside Solr. Daemon's have the ability to subscribe to topics, apply a classifier, update collections and take other actions. This provides the muscle behind the alerts.<br />
<br />
<h2>
<span style="color: #660000;">Use Case: A Threat Alerting Engine</span></h2>
<br />
Let's walk through the steps for building an engine that detects threats in social media posts and sends alerts.<br />
<br />
<h3>
<span style="color: #660000;">Step 1: Train a model that classifies threats</span></h3>
<br />
To train the model we need to assemble a training set comprised of positive and negative examples of the class. In this case we need to gather a set of social media posts that are threats and another set of social media posts that are not threats.<br />
<br />
Once we have our training set we can load the data into a Solr Cloud collection. Then we can use the <b>features</b>, <b>train </b>and<b> update </b>functions to extract the key features, train the model (based on the features) and store the model iterations in a Solr Cloud collection :<br />
<br />
<b>update</b>(models,<br />
batchSize="50",<br />
<b>train</b>(trainingSet,<br />
<b>features</b>(trainingSet,<br />
q="*:*",<br />
featureSet="threatFeatures",<br />
field="body",<br />
outcome="out_i",<br />
numTerms=250),<br />
q="*:*",<br />
name="threatModel",<br />
field="body",<br />
outcome="out_i",<br />
maxIterations="100"))<br />
<br />
<br />
<br />
Let's explore this expression in more detail.<br />
<br />
The <b>features</b> function extracts the key features (terms) from the training data. The features function scores terms using a technique known as <b>Information Gain</b> to determine which features are the most important in distinguishing the positive and negative set. The features function also extracts the term statistics from the training set needed to build the model. <i>The online documentation contains more detailed information for the <a href="https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions#StreamingExpressions-features" target="_blank">features</a> function.</i><br />
<br />
The <b>train</b> function uses the terms and statistics emitted by the <b>features </b>function to train the model. The train function use a parallel iterative, batch gradient descent approach to train a logistic regression text classifier. The train function emits a tuple for each iteration of the model. Each model tuple includes the terms, weights, error rate and a <a href="https://en.wikipedia.org/wiki/Confusion_matrix" target="_blank">confusion matrix</a> that describes the classification errors for the iteration. <i>The online documentation contains more detailed information for the <a href="https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions#StreamingExpressions-train" target="_blank">train</a> function.</i><br />
<br />
The <b>update </b>function is used to store each iteration of the model in a Solr Cloud collection. <i>The online documentation contains more detailed information for the <a href="https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions#StreamingExpressions-update" target="_blank">update</a> function. </i><br />
<br />
<h3>
<span style="color: #660000;">Step 2: Inspect and test the model</span></h3>
<br />
The next step is to validate that you have a good model. A quick inspection of the model iterations can be useful in understanding the model. Here are a few things to look for:<br />
<br />
a) Look at the terms in the model. A quick overview of the terms can be useful in understanding what the key features are in the model.<br />
<br />
b) Look at weights for the terms to get a feel for how the terms are weighted in the model.<br />
<br />
c) Look at the confusion matrix for the last iteration of the model to get a feeling for error rates that occurred in the final training iteration.<br />
<br />
d) Look at the iterations to see if the model converged, which means the error rates gradually reduced until they reach a point of stabilization.<br />
<br />
After reviewing the model we can test the model on a test dataset. The test dataset should consist of positive and negative examples of the class and should not be the same as the training dataset.<br />
<br />
The expression below runs the classifier on the test set:<br />
<br />
<b>classify</b>(<b>model</b>(models,<br />
id="threatModel",<br />
cacheMillis=5000),<br />
<b>search</b>(testData,<br />
q="*:*",<br />
fl="id, body",<br />
sort="id desc"),<br />
field="body")<br />
<br />
Let's breakdown the expression above.<br />
<br />
The <b>classify</b> function calls the <b>model</b> function to retrieve a named model stored in a Solr Cloud collection. In the example above it retrieves the threatModel which we built earlier. The model function retrieves the highest iteration of the model found in the collection and caches it in memory for a specified period of time. <i>For more detailed information on the <a href="https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions#StreamingExpressions-classify" target="_blank">classify</a>, <a href="https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions#StreamingExpressions-model" target="_blank">model</a> and <a href="https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions#StreamingExpressions-search" target="_blank">search</a> functions you can review the documentation.</i><br />
<br />
Once the classify function has the model, it reads tuples from an internal stream and scores the tuples by applying the model to a text field in the document. The classify function emits all the tuples from the underlying stream and adds a field called<b> probability_d</b>, which is a score between 0 and 1 that describes the probability that the tuple is in the class. The higher the score, the higher the probability. You can start with the score of .5 or greater as the threshold for positive or negative class assignment.<br />
<br />
You can then read the tuples to see how they were scored and determine how accurate the classifier is. If there are false positives or false negatives in the test results then inspect the records to see why they might have been misclassified. If there are missing features in the training set you may need to add more examples that include these missing features and rebuild the model.<br />
<br />
You can also adjust the threshold to see if it creates a more accurate classifier. For example if changing the threshold from .5 to .6 provides fewer false positives without increasing false negatives then you can use that as the threshold for the classifier when deployed.<br />
<br />
<h3>
<span style="color: #660000;">Step 3: Setup the Actor</span></h3>
<br />
Once you're satisfied with the model, its time to create an <b>actor</b> that can read new documents as they enter the index, classify them and index the threats in a separate collection.<br />
<br />
To do this we'll setup a <b>daemon</b> to run the classify function. But instead of using a <b>search</b> function as the internal stream to classify, we'll use a <b>topic</b>.<br />
<br />
Here is the basic syntax for this:<br />
<br />
<b>daemon</b>(id="alertDaemon",<br />
<b> update</b>(threats,<br />
batchSize="10",<br />
<b>having</b>(<b>classify</b>(<b>model</b>(models,<br />
id="threatModel",<br />
cacheMillis="5000"),<br />
<b>topic</b>(messages,<br />
checkpoints,<br />
id="messageTopic",<br />
q="*:*",<br />
fl="id, body"),<br />
field="body"),<br />
gt(probability_d, 0.5)<br />
)<br />
)<br />
)<br />
<br />
<br />
Let's explore this expression from inside-out.<br />
<br />
The inner <b>topic </b>expression is subscribing to the messages collection. The messages collection holds the social media messages we are alerting on. The topic will provide one time delivery of documents in this collection. Each time it is called it will return a new batch of documents.<br />
<br />
The <b>classify</b> function wraps the <b>topic</b> and scores each tuple based on the <b>model </b>retrieved by the internal <b>model</b> function. The classify function emits the tuples from the topic and adds the class score to the outgoing tuples in the <b>probability_d</b> field.<br />
<br />
The <b>having</b> expression wraps the <b>classify</b> function and only emits tuples with a <b>probability_d </b>greater than 0.5. The having function is setting the classification threshold for the alert. <i>The online documentation contains more detailed information for the <a href="https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions#StreamingExpressions-having" target="_blank">having</a> function.</i><br />
<br />
The <b>update</b> function indexes the tuples emitted by the <b>having</b> function into a Solr Cloud collection called <b>threats</b>. The <b>threats</b> collection is where the messages classified as threats are stored.<br />
<br />
The <b>daemon</b> function wraps the <b>update </b>function and calls it at intervals using an internal thread. Each time the daemon runs, a new batch of messages will be classified and the threats will be added to the threats collection. New records added to the messages collection will automatically be classified in the order they were added. <i>The online documentation contains more detailed information for the <a href="https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions#StreamingExpressions-daemon" target="_blank">daemon</a> function.</i><br />
<b><br /></b>
<br />
<h3>
<b><span style="color: #660000;">Step 4: Adding actors to watch the threats collection</span></b></h3>
<br />
We now have a <b>threats</b> collection where the threats are being indexed. Any number of <b>actors</b> can subscribe to the threats collection using a topic function or TopicStream java class. These actors can be daemons running inside Solr or actors external to Solr.<br />
<br />
In this scenario the threats collection is used as a message queue for other actors. These actors can then be programmed to take specific actions based on the threat.<br />
<br />
<br />
<style type="text/css">
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo}
span.s1 {font-variant-ligatures: no-common-ligatures}
span.s2 {font-variant-ligatures: no-common-ligatures; background-color: #00e6e5}
</style>Joel Bernsteinhttp://www.blogger.com/profile/15670652646728608504noreply@blogger.comtag:blogger.com,1999:blog-447314081038493192.post-24485080753681192342017-01-18T19:07:00.001-08:002017-05-17T19:13:14.822-07:00Deploying Solr's New Parallel ExecutorIn an a recent blog we explored Solr's new <a href="http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html" target="_blank">parallel batch</a> capabilities. This blog expands on parallel batch and introduces Solr's new <b>parallel executor</b>. Solr's parallel executor allows Streaming Expressions to be stored in a Solr Cloud collection where they can be streamed to worker nodes and executed in parallel.<br />
<h2>
</h2>
<h2>
</h2>
<h2>
</h2>
<h2>
<br /></h2>
<h2>
The <span style="color: #660000;">executor </span>Function</h2>
<br />
The new <b>executor </b>function performs the <i>compilation and execution</i> of Streaming Expressions on worker nodes. The executor function has an internal thread pool that executes Streaming Expressions in parallel within a single worker node. The queue of streaming expressions can also be partitioned across a cluster of worker nodes providing a second level of parallelism.<br />
<br />
<br />
<h2>
Deploying a General Purpose <span style="color: #660000;">Work Queue</span></h2>
<br />
The <b>executor</b> function can be used with the <b>daemon</b> and <b>topic</b> functions to deploy a general purpose work queue. An example of this expression construct is below:<br />
<br />
<b>daemon</b>(id="daemon1",<br />
<b>executor</b>(threads=5,<br />
<b>topic</b>(checkpointCollection,<br />
storedExpressions,<br />
id="topic1",<br />
initialCheckpoint=0,<br />
q="*:*",<br />
fl="id, expr_s")))<br />
<br />
Let's break down the expression above starting with the<b> topic </b>function.<br />
<br />
The <b>topic</b> function subscribes to a query and provides one-time delivery of documents that match the query. In the example, the topic function is subscribed to a collection of stored Streaming Expressions.<br />
<br />
The <b>executor</b> function wraps the topic and for each tuple it compiles and runs the expression in the <b>expr_s </b>field. The executor has an internal thread pool and each expression is compiled and run in its own thread. The threads parameter controls the size of the thread pool.<br />
<br />
The<b> daemon</b> function wraps the executor and calls it at intervals using an internal thread. This will cause the executor to iterate over the topic and execute all the Streaming Expressions in the work queue in batches.<br />
<br />
The daemon function will continue to run at intervals when the queue is empty. As new tasks are indexed into the queue they will automatically be read by the topic and executed.<br />
<br />
<h2>
</h2>
<h2>
Prioritizing Tasks with the <span style="color: #660000;">priority</span> Function </h2>
<br />
In the example above, the executor will run tasks in the order that they are emitted by the topic. Topic's emit tuples ordered by Solr's internal _version_ number. This behaves similar to a FIFO queue (but without strict FIFO enforcement). But the topic alone doesn't have any concept of task prioritization.<br />
<br />
The <b>priority </b>function can be used to allow higher priority tasks to be scheduled ahead of lower priority tasks. The priority function wraps two topics. The first topic is the higher priority queue and the second topic is the lower priority queue. The priority function will only emit a lower priority task when there are no higher priority tasks in the queue.<br />
<br />
<b>daemon</b>(id="daemon1",<br />
<b>executor</b>(threads=5,<br />
<b>priority</b>(<b>topic</b>(checkpointCollection,<br />
highPriorityTasks,<br />
id="high",<br />
initialCheckpoint=0,<br />
q="*:*",<br />
fl="id, expr_s"),<br />
<b> topic</b>(checkpointCollection,<br />
lowPriorityTasks,<br />
id="low",<br />
initialCheckpoint=0,<br />
q="*:*",<br />
fl="id, expr_s")))<br />
<br />
<h2>
<b><br /></b></h2>
<h2>
<b>Deploying a <span style="color: #660000;">Parallel Work Queue</span></b></h2>
<br />
The <b>parallel</b> function can be used to partition tasks across a worker collection. This provides parallel execution within a single worker and across a cluster of workers. The syntax for deploying a parallel work queue is below:<br />
<br />
<b>parallel(</b>workerCollection<b>,</b><br />
<b> </b>workers=6,<br />
<b> </b> sort="DaemonOp asc",<br />
<b> daemon</b>(id="daemon1",<br />
<b>executor</b>(threads=5,<br />
<b>topic</b>(checkpointCollection,<br />
storedExpressions,<br />
id="topic1",<br />
initialCheckpoint=0,<br />
q="*:*",<br />
fl="id, expr_s",<br />
<span style="color: #660000;">partitionKeys="id"</span>))))<br />
<br />
In the example above the parallel function sends daemons to 6 workers. Each worker executes a partition of the work queue.<br />
<br />
<h2>
Deploying <span style="color: #660000;">Replicas</span> to Increase Cluster Capacity</h2>
<br />
Expressions run by the executor search Solr Cloud collections to retrieve data. These searches will be spread across all of the replicas in the Solr Cloud collections. As the number of workers executing expressions increases, more replicas can be added to the Solr Cloud collections to increase the capacity of the entire system.<br />
<br />Joel Bernsteinhttp://www.blogger.com/profile/15670652646728608504noreply@blogger.comtag:blogger.com,1999:blog-447314081038493192.post-18375656089956029672017-01-12T12:16:00.000-08:002017-01-12T12:16:53.745-08:00Solr 6.3: Finding the most relevant facets with the scoreNodes Streaming ExpressionStarting with Solr 6.3 you can use the <b>scoreNodes</b> Streaming Expression to find the most relevant facets and significant relationships in a distributed graph. This blog describes how the scoreNodes function can be used with facets. A future blog will cover using scoreNodes with graph expressions.<br />
<br />
<b>Why Score Facets?</b><br />
<b><br /></b>
One typical use case for scoring facets would be for lightning fast recommendations based on market basket co-occurrence. We'll explore this scenario below:<br />
<b><br /></b>
First let's look at the syntax for scoring facets:<br />
<br />
scoreNodes(facet(baskets,<br />
q="products:A",<br />
buckets="products",<br />
bucketSorts="count(*) desc",<br />
bucketSizeLimit=50,<br />
count(*)))<br />
<br />
Let's breakdown what the expression is doing.<br />
<br />
The <b>facet</b> expression calls Solr's JSON facet API and emits tuples which contain the facet results. In this case it is searching the <b>baskets</b> collection. The query is looking for all records in the baskets collection that have product A in the products field.<br />
<br />
The baskets collection contains a multi-valued field called products which contains all the products in the basket. For example<br />
<br />
id products<br />
basket1 [A, B, C]<br />
basket2 [A, C, E]<br />
basket3 [B, C, D]<br />
<br />
The sample facet expression will return the following tuples:<br />
<br />
products: C<br />
count(*): 2<br />
<br />
products: B<br />
count(*): 1<br />
<br />
products: E<br />
count(*): 1<br />
<br />
Product <b>C</b> is in two baskets with product <b>A</b>. Products <b>B</b> and <b>E</b> are both in one basket with product <b>A</b>.<br />
<br />
So it would seem that the most relevant facet/product for product <b>A</b> would be product <b>C</b>.<br />
<br />
But, there is something we don't know yet. How often product <b>C</b> occurs in all the baskets. If product <b>C</b> occurs in a large percentage of baskets, then it doesn't have any particular relevance to product <b>A</b>.<br />
<br />
This is where the scoreNodes function does it's magic. The scoreNodes function scores the facets based on the raw facet counts and their frequency across the entire collection.<br />
<br />
The scoring formula is similar to the tf*idf scoring algorithm used to score results from a full text search. In the full text context <b>tf</b> (term frequency) is the number of times the term appears in the document. <b>idf</b> (inverse document frequency) is computed based on the document frequency of the term, or how many documents the term appears in. The idf is used to provide a boost to terms that are more rare in the index.<br />
<br />
scoreNodes uses the same principal to score facets, but the <b>facet count</b> is used as the <b>tf</b> value in the formula. The <b>idf</b> is computed for each facet term based on global statistics across the entire collection. The effect of the scoreNodes algorithm is to provide a boost to facet terms that are rarer in the collection.<br />
<br />
The scoreNodes functions adds a field to each facet tuple called <b>nodeScore, </b>which is the relevance score for the facet. You can use the <b>top</b> expression to find the most relevant facet:<br />
<br />
top(n=2, sort="nodeScore desc",<br />
scoreNodes(facet(baskets,<br />
q="products:A",<br />
buckets="products",<br />
bucketSorts="count(*) desc",<br />
bucketSizeLimit=50,<br />
count(*)))<br />
<div>
<br /></div>
<div>
The expression above emits the two highest scoring facets based on the <b>nodeScore</b>.</div>
<div>
<br /></div>
Joel Bernsteinhttp://www.blogger.com/profile/15670652646728608504noreply@blogger.comtag:blogger.com,1999:blog-447314081038493192.post-3915759839901935492016-10-31T09:56:00.000-07:002016-10-31T13:42:26.824-07:00Solr 6.3: Batch jobs, Parallel ETL and Streaming Text TransformationSolr 6.3 is on it's way and along with it comes a new execution paradigm for Solr: <b>Parallel batch</b>. Solr's new batch capabilities open up a new world of use cases for Solr. This blog will cover the basics of how the parallel batch framework operates and describe how it can be used to perform parallel ETL and parallel text transformations.<br />
<br />
<h3>
<b>Not built on MapReduce</b></h3>
<b><br /></b>
Solr's Streaming Expressions have had MapReduce capabilities for quite a while now. But Solr's MapReduce implementation is designed to support interactive queries over large data sets and to power the <b>Parallel SQL interface </b>when run in MapReduce mode.<br />
<br />
And notably Solr's MapReduce implementation does not support streaming of text fields, which makes it unsuitable for performing ETL and text transformations in Solr.<br />
<br />
<h3>
<b>Parallel batch is built on message queues</b></h3>
<b><br /></b>
Solr also has massaging capabilities that allow an entire SolrCloud collection to be treated like a message queue, similar in nature to Apache Kafka. While Apache Kafka is more scalable, Solr's messaging queue is more flexible in that it allows you to subscribe to a query.<br />
<br />
The Streaming Expression that allows you to subscribe to a query is the <b>topic</b> expression. Here is the basic syntax:<br />
<br />
<b>topic</b>(checkpointCollection,<br />
dataCollection,<br />
q="*:*",<br />
fl="from, to, body",<br />
id="myTopic",<br />
rows="300",<br />
initialCheckpoint="0")<br />
<br />
Here is an explanation of the parameters:<br />
<br />
<ol>
<li><b>checkpointCollection</b>: The topic function tracks the checkpoints for a specific topic id. It stores the checkpoints in this collection.</li>
<li><b>dataCollection:</b> This is the collection that the topic results are pulled from.</li>
<li><b>q</b>: The query to use to pull records for this topic.</li>
<li><b>id</b>: The unique id of the topic. A different set of checkpoints will be maintained for each unique topic id.</li>
<li><b>rows</b>: The number of rows to fetch from each shard, each time the topic function is called.</li>
<li><b>initialCheckpoint</b>: Where in the queue to start fetching results from. Setting to 0 will cause the topic to match all the records that match the topic query in the collection. Not setting the initialCheckpoint will cause the topic to begin fetching records that have been added after the topic has been initiated. </li>
</ol>
<div>
When the topic function is sent to the /stream handler it will retrieve a batch of rows from the topic and update the checkpoints for the topic. The next time it's called it will retrieve the next batch of rows.</div>
<div>
<br /></div>
<div>
The topic function has no restriction on the data that can be retrieved. So it can return any stored fields including stored text fields.</div>
<div>
<br /></div>
<br />
<h3>
<b>Iterating the topic with a daemon</b></h3>
<br />
In order to process all the records in a topic, we will need to call the topic repeatedly until it stops returning results. The <b>daemon</b> function can do this for us.<br />
<br />
The daemon function wraps another function and calls it at intervals using an internal thread. When a daemon is passed to the /stream handler, the /stream handler recognizes it and keeps it in memory so that it can run until it's completed it's job.<br />
<br />
Here is the basic syntax:<br />
<br />
<b>daemon</b>(id="myDaemon",<br />
terminate="true",<br />
<b> topic</b>(checkpointCollection,<br />
dataCollection,<br />
q="*:*",<br />
fl="from, to, body",<br />
id="myTopic",<br />
rows="300",<br />
initialCheckpoint="0"))<br />
<br />
Notice the <b>terminate</b> parameter, which is new in Solr 6.3. This tells the daemon to terminate after the topic has no more records to process.<br />
<br />
<h3>
<b>Sending Tuples to another SolrCloud collection</b></h3>
<br />
As the daemon iterates a topic it can send the results to another SolrCloud collection using the <b>update</b> function.<br />
<br />
Here is the basic syntax:<br />
<br />
<b>daemon</b>(id="myDaemon",<br />
terminate="true",<br />
<b>update</b>(destinationCollection,<br />
batchSize=300,<br />
<b> topic</b>(checkpointCollection,<br />
dataCollection,<br />
q="*:*",<br />
fl="from, to, body",<br />
id="myTopic",<br />
rows="300",<br />
initialCheckpoint="0")))<br />
<br />
The example above is sending all the Tuples that are emitted by the topic to another SolrCloud collection.<br />
<h3>
<br /><b>Performing transformations on the Tuples</b></h3>
<br />
The data in the Tuples emitted by the topic can be adjusted/transformed before they are sent to the destinationCollection. Here is a very simple example:<br />
<br />
<b>daemon</b>(id="myDaemon",<br />
terminate="true",<br />
<b>update</b>(destinationCollection,<br />
batchSize=300,<br />
<b> select</b>(from as from_s,<br />
to as to_s,<br />
body as body_t,<br />
<b> topic</b>(checkpointCollection,<br />
dataCollection,<br />
q="*:*",<br />
fl="from, to, body",<br />
id="myTopic",<br />
rows="300",<br />
initialCheckpoint="0"))))<br />
<br />
In the example above the<b> select </b>function is changing the field names in the Tuples before the tuples are processed by the <b>update</b> function.<br />
<b><br /></b>
<br />
<h3>
<b>Text analysis and transformation</b></h3>
<br />
Starting with Solr 6.3 Streaming Expressions have access to Lucene/Solr's text analyzers. This allows functions to be written that process text using the full power of these analyzers. The first example of this is the <b>classify funtion </b>in Solr 6.3, which analyzes text in a text field and extracts the features needed to perform classification using a linear classifier. This deserves an entire blog in it's own right so it's only mentioned here as an example of how Streaming Expressions can use analyzers.<br />
<br />
Also in Solr 6.3 you can add your own Stream Expressions and register them in the solrconfig.xml. So you can write and plugin your own text analysis functions.<br />
<b><br /></b>
<br />
<h3>
<b>Fetching data from other collections with the fetch function</b></h3>
<b><br /></b>
In Solr 6.3 the <b>fetch</b> function has been added that allows fields from other collections to be fetched and added to documents. Here is the sample syntax:<br />
<br />
<b>daemon</b>(id="myDaemon",<br />
terminate="true",<br />
<b>update</b>(destinationCollection,<br />
batchSize=300,<br />
<b> fetch</b>(userAddresses,<br />
on="from=userID",<br />
fl="address",<br />
<b> topic</b>(checkpointCollection,<br />
dataCollection,<br />
q="*:*",<br />
fl="from, to, body",<br />
id="myTopic",<br />
rows="300",<br />
initialCheckpoint="0"))))<br />
<br />
The example above fetches the <b>address</b> from the <b>usersAddresses </b>collection, by mapping the <b>from</b> field in the Tuples to the <b>userID</b> in the userAddresses collection.<br />
<br />
<h3>
<b>Parallel batch processing</b></h3>
<br />
The <b>parallel </b>function can be used to partition a batch job across a cluster of worker nodes. Here is the basic syntax:<br />
<br />
<b>parallel(</b>workerCollection,<br />
workers=10,<br />
sort="DaemonOp desc",<br />
<b> daemon</b>(id="myDaemon",<br />
terminate="true",<br />
<b>update</b>(destinationCollection,<br />
batchSize=300,<br />
<b> select</b>(from as from_s,<br />
to as to_s,<br />
body as body_t,<br />
<b> topic</b>(checkpointCollection,<br />
dataCollection,<br />
q="*:*",<br />
fl="id, from, to, body",<br />
id="myTopic",<br />
rows="300",<br />
initialCheckpoint="0",<br />
<span style="color: #660000;">partitionKeys</span>="id")))))<br />
<br />
In this example the parallel function sends the daemon to 10 worker nodes. Each worker will process a partition of the topic. Notice that the <b>partitionKeys </b>field has been added to the topic. This tells the topic to hash partition on the id field in the dataCollection.<br />
<br />
Quick note about the DaemonOp sort parameter. The parallel function sends the daemon to 10 worker nodes. The worker nodes return a Tuple that confirms that a daemon operation was started. The DaemonOp is simply the confirmation Tuple. The parallel function never sees the tuples generated by the topic function as they are sent to another SolrCloud collection.<br />
<h3>
<b><br /></b><b>Parallel batch throughput</b></h3>
<br />
When performing parallel batch operations, each worker will be iterating over the topic in parallel. The topic function randomly selects a replica from each shard to query for it's data. So when performing parallel batch operations all of the replicas in the cluster will be streaming content at once. Joel Bernsteinhttp://www.blogger.com/profile/15670652646728608504noreply@blogger.com