### Sunplot

The last several blogs have discussed the new statistical programming syntax for Streaming Expressions. What was missing in those blogs was

**plotting**. Plotting plays a central role in statistical analysis. Plotting allows you to quickly understand the shape of your data in a way that the numbers alone cannot.
Sunplot is a new statistical plotting engine written by Michael Suzuki to work specifically with Solr's statistical programming syntax. This blog explores some of the features of Sunplot.

**SQL and Statistical Expressions**

Sunplot supports both SQL and Streaming Expressions. The SQL queries are sent to Solr's parallel SQL interface which evaluates the query across Solr Cloud collections. Streaming Expressions and statistical functions are evaluated by Solr's stream handler.

Sunplot has a

**json view**,**table view**and**charting view.**The image below shows a SQL query with results in the table view.The main code window handles both SQL and Streaming Expressions.

**The Plot Function**

Plotting of statistical functions is handled by the new

**plot**function. The plot function allows you to specify arrays for the

**x**and

**y**axis and set the plot

**type**. Supported plot types are scatter, line, bar and pie.

Below is a screenshot of a very simple plot command:

Notice that the plot function is plotting hard-coded arrays. Using this approach you can use Sunplot as a general purpose plotting tool.

The plot function also plots arrays generated by Streaming Expressions and statistical functions.

**Scatter Plots**

One of the core statistical plot types is the scatter plot. A scatter plot can be used to quickly understand how individual samples are distributed. It is also very helpful in visualizing the outliers in a sample set.

The screenshot below shows a statistical expression and scatter plot of the result set.

Let's explore the statistical syntax shown in the screen shot and interpret the scatter plot.

let(a=random(collection1, q="*:*", rows="500", fl="test_d"),

b=col(a, test_d),

plot(type=scatter, y=b))

- The
function is setting variables*let*and then executing the*a, b*function.*plot* - Variable
is holding the output of the*a*function. The random function is returning 500 random result tuples from collection1. Each tuple has a single field called*random**test_d.* - Variable
is holding the output of the*b*function. The col function returns a numeric array containing the values in the test_d field from the tuples stored in variable*col*.*a* - The
function returns the*plot*coordinates and the plot*x,y*used by Sunplot to draw the plot. In the example the*type*access is set to the numeric array stored in variable*y*. If no*b*axis is provided the plot function will generate a sequence for the*x*axis.*x*

**Reading the Scatter Plot**

The scatter plot moves across the

**from the left to right and plots the***x axis***for each point. This allows you to immediately see how the y axis points are spread.***y axis*
In the example you can tell a few things very quickly:

1) The points seem to fall fairly evenly above and below 500.

2) The bulk of the points fall between 480 and 520.

3) Virtually all of the points fall between 460 and 540.

4) There are a few outliers below 460 and above 540.

This data set seems to have many of the characteristics of a

**normal distribution**. In a normal distribution most of the points will be clustered above and below the**mean**. As you continue to move farther away from the mean the number of points taper off until there are just a few outliers.**Sorting the Points**

We can learn more about the data set by sorting the

**y axis**points before plotting. In the example below note how the*asc*function is applied to first sort the**y axis**points before plotting.Once sorted you can see the how the lower outliers and upper outliers form curves with steeper slopes while the bulk of the points form a gently sloping line passing through the mean.

**Histograms**

Now that we've seen the scatter plot of the individual points we can continue to visualize the data by plotting a histogram with the points.

Before plotting lets look at how to create a histogram and what a histogram output looks like:

Let's explore the statistical expression that builds and outputs a histogram:

let(a=random(collection1, q="*:*", rows="500", fl="test_d"),

b=col(a, test_d),

c=hist(b, 7),

get(c))

- The
function is setting variables*let*and then executes the*a, b, c*function.*get* - Variable
is holding the output of the*a*function. The random function is returning 500 random result tuples from collection1. Each tuple has a single field called*random**test_d.* - Variable
is holding the output of the*b*function. The col function returns a numeric array containing the values in the test_d field from the tuples stored in variable*col*.*a* - Variable
is holding the output of the*c*function. The hist function creates a histogram with 7 bins from the numeric array stored in variable*hist*. The histogram returns one tuple for each bin with a statistical summary of the bin.*b* - The
function returns the list of histogram tuples held in variable c.*get*

**field is the number of observations that fall within the bin. The**

*N***is the mean value of observations within the bin.**

*mean*To plot the histogram will need to extract the

**and**

*N***columns into arrays. We will then use the**

*mean***array as the x axis and the**

*mean***array as the y axis. We will use 11 bins for the plot.**

*N*The screen shot below shows the statistical expression and plot of the histogram:

The histogram plot has the bell curve you would expect to see with a normal distribution. Both the scatter plot and histogram plot are pointing to a normal distribution.

Now we'll take a quick look at a statistical test to confirm that this data is a normal distribution.

**Descriptive Statistics**

First lets compute the descriptive statistics for the sample set with the describe function:

The statistical expression above outputs a single tuple with the descriptive statistics for the sample set. Notice that the sample has a mean of 500 and a standard deviation of 20. Both the scatter and histogram plots provide visual confirmation of these statistics.

**Normal Distribution Testing With Kolmogorov–Smirnov Test**

Now that we know the mean and standard deviation we have enough information to run a one sample Kolmogorov–Smirnov (k-s) Test. A one sample k-s test is used to determine if a sample data set fits a reference distribution.

The screenshot below shows the syntax and output for the k-s test:

The expression in the example calls the

**function which returns a reference distribution for the ks function. The normalDistribution function is created with a mean of 500 and standard deviation of 20 which is the same as the sample set.**

*normalDistribution*The

**function is then run using the reference distribution and the sample set.**

*ks*The

**returned from the ks test is 0.38. This means that there is a 38% chance you would be wrong if you rejected the hypothesis that the sample set could have been taken from the reference distribution. Typically a p-value of .05 or lower is taken as evidence that we can reject the test hypothesis.**

*p-value*Based on the p-value the ks test confirms that the sample set fits a normal distribution.