Anomaly detection in Analysis Workspace uses a series of advanced statistical techniques to determine whether an observation should be considered anomalous or not.
Depending on the date granularity used in the report, 3 different statistical techniques are used – specifically for hourly, daily, weekly/monthly anomaly detection. Each statistical technique is outlined below.
For daily granularity reports, the algorithm considers several important factors to deliver the most accurate results possible. First, the algorithm determines which type of model to apply based on available data of which we select between one of two classes – a time-series-based model or an outlier-detection model (called functional filtering).
The time series model selection is based on the following combinations for type of error, trend, and seasonality (ETS) as described by Hyndman et al. (2008). Specifically, the algorithm tries the following combinations:
The algorithm tests the suitability of each of these by selecting the one with the best mean absolute percentage error (MAPE). If the MAPE of the best time series model is greater than 15% however, functional filtering is applied. Typically, data with a high degree of repetition (e.g. week over week or month over month) is the best fit with a time series model.
After model selection, the algorithm then adjusts the results based on holidays, and year-over-year seasonality. For holidays, the algorithm checks to see if any of the following holidays are present in the reporting date range:
These holidays were selected based on extensive statistical analysis across many customer data points to identify holidays that mattered the most to the highest number of customers’ trends. While the list is certainly not exhaustive for all customers or business cycles, we found that applying these holidays significantly improved the performance of the algorithm overall for nearly all customers’ datasets.
Once the model has been selected and holidays have been identified in the reporting date range, the algorithm proceeds in the following manner:
Notice the dramatic improvement of performance on Christmas Day and New Year’s Day in the following example:
Hourly data relies on the same time series algorithm approach that the daily granularity algorithm does. However, it relies heavily on two trend patterns: the 24-hour cycle as well as the weekend/weekday cycle. To capture these two seasonal effects, the hourly algorithm constructs two separate models for a weekend and a weekday using the same approach outlined above.
The training windows for hourly trends relies on a 336-hour lookback window.
Weekly and monthly trends do not exhibit the same weekly or daily trends found at daily or hourly granularities, so as such a separate algorithm is used. For weekly and monthly, a two-step outlier detection approach is used known as the Generalized Extreme Studentized Deviate (GESD) test. This test considers the maximum number of expected anomalies combined with the adjusted box-plot approach (a non-parametric method for outlier discovery) to determine the maximum number of outliers. The two steps are:
The holiday and YoY seasonality anomaly detection step then subtracts last year’s data from this year’s data and then iterates on the data again using the two-step process above to verify that anomalies are seasonally appropriate. Each of these date granularities uses a 15-period lookback inclusive of the selected reporting date range (either 15 months or 15 weeks) and a corresponding date range 1 year ago for training.