One of the most powerful uses of Splunk rests in its ability to take large amounts of data and pick out outliers in the data. For some events this can be done simply, where the highest values can be picked out via commands like rare and top. However, more subtle anomalies or anomalies occurring over a span of time require a more advanced approach.
This article will offer an explanation of the standard score (also known as z-score) in statistics, how to implement it in Splunk’s search processing language (SPL), and some caveats associated with the technique. By the end of this article you should have a better familiarity with these statistical concepts and gain some intuition on the appropriate uses of such techniques.
Commands and subcommands
There are several commands and subcommands that this technique uses. Below is a brief overview of these; feel free to skip this section if you’re already familiar with them.
The bin/bucket commands (which can be used interchangeably) break timestamps down into chunks we can use for processing in the stats command.
- Average: calculates the average (sum of all values over the number of the events) of a particular numerical field.
- Stdev: calculates the standard deviation of a numerical field. Standard deviation is a measure of how variable the data is. If the standard deviation is low, you can expect most data to be very close to the average. If it is high, the data is more spread out.
- Count: provides a count of occurrences of field values within a field. You’ll want to use this if you’re dealing with text data.
- Sum: provides a sum of all values of data within a given field. You’ll want to use this for numerical data (e.g. if the field contains the number of bytes transferred in the event).
How many events do we need?
When calculating the statistics mentioned above, we need to make sure the sample size we’re choosing accurately represents the data. If we choose too small of a timeframe, we might not get a representative sample of the data. Our calculations could produce either a lot of false positives or miss some anomalous events as a result.
Luckily, the Central Limit Theorem offers us some insight into how many events we need for a good sample. The short version of the theorem states that as sample size increases, the mean (average) of the sample data will be closer to the mean of the overall population. Since getting an average for all your data is likely impractical computationally, we can use this theorem to our advantage. If we can create a search that has around 30 data points per time span, we’ll likely have enough data to have an accurate sample.
Applying what we learned
Given this information, we can do something like the following to calculate some statistics about the normal indexing of data, which we save into a lookup for future reference:
The above produces a lookup containing the amount of data indexed for an index in a 15m period.
From this we can begin to work on our detection search. We’ll join the historical statistical data we saved to the lookup with a new search that will look for drops. After we do so, we can calculate the z-score, which tells us the number of standard deviations a particular value is from the average.
More about z-score
How do we determine what value of z-score to set for our threshold? The answer is a bit complicated. There are, however, a few rules that we can take into consideration to help us decide:
1. 68–95–99.7 rule
This rule applies to totally normal distributions (where the data looks like a standard bell curve https://en.wikipedia.org/wiki/File:Standard_deviation_diagram.svg <- good chart). The quick takeaway is that if the distribution is normal, we can expect 99.7% of values to have a z-score of less than 3.
2. Chebyshev’s inequality
This is a more general rule stating that for a wide class of probability distributions, we only expect values to be a certain distance (measured in standard deviation) from the mean. https://en.wikipedia.org/wiki/Chebyshev%27s_inequality
The quick takeaway is that for most distributions we expect 99% of values to have a z-score of less than 10.
In the above example, we’re assuming that the distribution matches a standard distribution, but your data may be different. In that case, you should apply the findings of Chebyshev’s inequality to determine the threshold to use.
Hopefully this article provided some insight into how to perform basic anomaly detection using some of Splunk’s built-in SPL commands. It should also give you an idea of what thresholds to use to determine what constitutes an anomaly. Happy Splunking!