Beware The Normal Distribution When Modeling Cybersecurity Data
The normal distribution is a core assumption in many statistical models that can poison the results of our statistical and ML cybersecurity models if we are not careful.
I am working through the statistics course in the third section of my mathematics for machine learning course and I was struck by something:
Many of the common statistical models we use build upon assumptions of a normal distribution.
Tools like the Central Limit Theorem and The Law of Large Numbers may make things seem clean, but reality is much messier when our data and timeline are limited.
Normal Distribution or the Gaussian
Anyone who is vaguely familiar with statistics is familiar with the normal or gaussian distribution pictured below. The reason it is so important is that many natural phenomena we experience will eventually converge to a normal distribution (i.e. height, weight, body temperature).
One of the mains reason this shows up so often is the central limit theorem (CLT). It is the basis for many statistical models that we find so useful today. The central limit theorem states that:
the sum or average of an infinite sequence of independent and identically distributed random variables, when suitably rescaled, tends to be a normal distribution.
We can couple the CLT with the law of large numbers (LLN) which states that:
as the number of identically distributed, randomly generated variables increases, their sample mean (average) approaches their theoretical mean.
This boils down to given enough data points, we will eventually get a normal distribution. The CLT and LNN are the foundation of basic statistics which makes them incredible tools for making inferences about large datasets.
Popular Models that Require Normal Distribution
To give you an idea about the extent that popular models use the normal distribution as an assumption, I included some below with some brief explanations.
Linear Regression: Both simple and multiple linear regression models assume that the residuals (the differences between observed and predicted values) are normally distributed.
t-tests: T-tests, including independent samples t-tests and paired samples t-tests, assume that the populations from which the samples are drawn follow a normal distribution.
ANOVA (Analysis of Variance): ANOVA assumes that the residuals from the model are normally distributed. This assumption is crucial for the validity of ANOVA results.
General Linear Models (GLM): GLMs extend the linear regression model to include other types of response variables such as count data or binary data. Normality of residuals is often assumed in GLMs.
Linear Discriminant Analysis (LDA): LDA assumes that the predictors are normally distributed within each group, and it assumes equal covariance matrices across groups.
Logistic Regression: Although logistic regression is used for binary outcomes, the assumption of normality is relevant for the residuals, especially in large samples.
Probit Regression: Similar to logistic regression, probit regression assumes that the errors follow a normal distribution.
Mixed-effects Models (Linear Mixed Models): These models assume normality of residuals, especially for the fixed-effects part of the model.
Factor Analysis: Factor analysis assumes multivariate normality of the observed variables.
Canonical Correlation Analysis (CCA): CCA assumes multivariate normality and linearity between the variables.
These are all useful tools in our arsenal in their own right, but the question is how does this affect us when applying them to the cybersecurity domain?
The Problem
We immediately run into 2 issues when looking at making statistical inferences on cybersecurity data.
We don’t have enough data for the Law of Large Numbers to Apply
For some things we care about, we do not have adequate data for the law of large numbers to take effect. Unfortunately, there is no magic number either as it varies from dataset to dataset.
We may be dealing with a metric where the signal is infrequent enough that it may takes years or decades to generate enough data In addition, we are usually dealing with the time constraints of a log retention policy.
Further confounding efforts is the data may take the shape of another distribution before the effects of CLT and LLN show themselves. In the long term it may be a normal distribution, but the medium term may more closely resemble a Pareto Distribution. This is due to the fact that distributions can be different for different time periods and even change over time due to changing conditions within the environment.
Infinite Variance May Break the CLT
There are also situations where variance can be infinite (i.e. losses due to data breaches) where losses can span from 0$ to bankruptcy. In this case we treat bankruptcy as equal to infinity since it represents the loss of all future revenue.
With infinite variance the CLT breaks since we cannot guarantee any stabilization towards the normal distribution. In this case we cannot converge to a normal distribution of losses due to data breaches if the company is dead.
This is why it’s quite dangerous to analyze datasets that can have infinite variance (i.e. power laws distributions) blindly with traditional statistical tools. You will inevitably skew your decisions towards the mean or average instead of the company-killers.
Implications
Let’s look at an example where we are training a neural network using linear regression:
Linear regression is based on the assumption that residuals, the differences between observed values and the values predicted by the model, are normally distributed. This means the introduction of outliers in a small sample would have a huge impact on the model which could render our neural network less effective.
Imagine a neural network trained on the 4 weeks of security logs during which you had a major breach. While the signal is great for true positives, we will still have a problem with false positives and false negatives since anomalies will be heavily skewed towards the particular TTPs of that incident and we don’t have a large enough baseline to adequately weight the anomalous behavior.
Conclusion
I wanted to write this post not only as a way to learn a bit more of the effects of the assumptions of a normal distribution, but also as a PSA to my fellow security folks adopting statistical and, particularly, ML models. If we are going to get a true edge on attackers using these models then we need to understand their limitations and how we can mitigate those limitations. Otherwise we will just end up spending more effort to add more complexity which will generate more noise.