"Your input variables / features must be Gaussian distributed" is a requirement of some machine learning models (especially linear models). But how do I know that the distribution of variables is Gaussian. This paper focuses on several methods to ensure that the variable distribution is Gaussian distribution.
This paper assumes that the reader has a certain understanding of Gaussian / normal distribution.
In this article, we will use the well-known Iris data from scikit learn.
First, let's import the required package.
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import load_iris #Converting the data from an array to a data frame X = pd.DataFrame(load_iris()["data"]).copy()
The input characteristic / variable is [0,1,2,3]
Method 1: histogram method
This is the first and a simple method to get the distribution of a variable. Let's draw a histogram of Iris data variables.
X.hist(figsize=(10,10))
The histogram above shows that variables 0 and 1 are close to the Gaussian distribution (1 seems to be the closest). And 3 and 4 don't look Gaussian at all. It should be noted that the histogram may be misleading (please refer to our previous articles for details).
Method 2: density map (KDE map)
Density maps are another way to plot the distribution of variables. They are similar to histograms, but they can show the distribution of variables more clearly than histograms.
fig,ax = plt.subplots(2,2,figsize=(10,10)) row = col = 0 for n,c in enumerate(X.columns): if (n%2 == 0) & (n > 0): row += 1 col = 0 X[c].plot(kind="kde",ax=ax[row,col]) ax[row,col].set_title(c) col += 1
Now I can see that the variables 0 and 1 are more Gaussian than shown in the histogram. Variables 2 and 3 also look a bit close to the Gaussian distribution, except for two peaks.
Method 3: Q-Q diagram
The Q-Q plot plots the data according to the specified distribution. In this case, the specified distribution will be "norm".
In Python, Q-Q plot can be drawn using the 'probplot' function of 'scipy'. As shown below.
from scipy.stats import probplotfor i in X.columns: probplot(x=X[i],dist='norm',plot=plt) plt.title(i) plt.show()
As can be seen from the Q-Q diagram above, variables 0 and 1 closely follow the red line (normal / Gaussian distribution). Variables 2 and 3 are away from the red line in some places, which makes them away from the Gaussian distribution. Q-Q chart is more reliable than histogram and density chart.
Method 4: Shapiro Wilk test
Shapiro Wilk test is a statistical test for normality. This is a quantitative method for testing normality. Shapiro Wilk test tests the null hypothesis: that is, the data is extracted from the normal distribution. To determine whether it is a normal distribution
In Python, you can use the 'shapiro' function of 'scipy' to perform shapiro - wilk verification. As shown below.
from scipy.stats import shapiro for i in X.columns: print(f'{i}: {"Not Gaussian" if shapiro(X[i])[1]<0.05 else "Gaussian"} {shapiro(X[i])}')
As can be seen from the above results, only variable 1 is Gaussian.
One disadvantage of Shapiro Wilk test is that it is unreliable once the sample size (or the length of the variable) exceeds 5000.
Method 5: Kolmogorov Smirnov test
Kolmogorov Smirnov test is a statistical test of goodness of fit. This test compares two distributions (in this case, one of the two distributions is a Gaussian distribution). The null hypothesis of this test is that the two distributions are the same (or not), and there is no difference between the two distributions.
In Python, the Kolmogorov Smirnov test can be performed using the "kstest" of the "scipy.stats" module, as shown below.
First, we will test the randomly generated normal distribution.
from scipy.stats import kstest np.random.seed(11) normal_dist = np.random.randn(1000) pd.Series(normal_dist).plot(kind="kde") print(f'{"Not Gaussian" if kstest(normal_dist,"norm")[1]<0.05 else "Gaussian"} {kstest(normal_dist,"norm")}')
Now we will test Iris data.
from scipy.stats import kstest for i in X.columns: print(f'{i}: {"Not Gaussian" if kstest(X[i].values,"norm")[1]<0.05 else "Gaussian"} {kstest(X[i].values,"norm")}')
The above results show that no variables have Gaussian distribution. Kolmogorov Smirnov test expects the input variables to have an ideal normal distribution.
Method 6: D'Agostino and Pearson's method
This method uses skewness and kurtosis to test normality. The null hypothesis of this test is that the distribution is derived from the normal distribution.
In Python, you can use the "normal test" function of the "scipy.stats" module to perform this test, as shown below.
from scipy.stats import normaltest for i in X.columns: print(f'{i}: {"Not Gaussian" if normaltest(X[i].values,)[1]<0.05 else "Gaussian"} {normaltest(X[i].values)}')
The above results show that the variables 0 and 1 are Gaussian. This test does not expect the distribution to be completely normal, but close to normal.
summary
These are some of the many methods used to test the normality of data. I personally prefer to combine all the above methods to determine whether the distribution of variables is Gaussian distribution, and keep in mind the data, problems and models used.
Author: KSV Muralidhar
Deep hub translation group