How to know whether the distribution of a variable is Gaussian?

"Your input variables / features must be Gaussian distributed" is a requirement of some machine learning models (especially linear models). But how do I know that the distribution of variables is Gaussian. This paper focuses on several methods to ensure that the variable distribution is Gaussian distribution.

This paper assumes that the reader has a certain understanding of Gaussian / normal distribution.

In this article, we will use the well-known Iris data from scikit learn.

First, let's import the required package.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

#Converting the data from an array to a data frame
X = pd.DataFrame(load_iris()["data"]).copy()

The input characteristic / variable is [0,1,2,3]

Method 1: histogram method

This is the first and a simple method to get the distribution of a variable. Let's draw a histogram of Iris data variables.

X.hist(figsize=(10,10))

The histogram above shows that variables 0 and 1 are close to the Gaussian distribution (1 seems to be the closest). And 3 and 4 don't look Gaussian at all. It should be noted that the histogram may be misleading (please refer to our previous articles for details).

Method 2: density map (KDE map)

Density maps are another way to plot the distribution of variables. They are similar to histograms, but they can show the distribution of variables more clearly than histograms.

fig,ax = plt.subplots(2,2,figsize=(10,10))
row = col = 0
for n,c in enumerate(X.columns):
    if (n%2 == 0) & (n > 0):
        row += 1
        col = 0
    X[c].plot(kind="kde",ax=ax[row,col])
    ax[row,col].set_title(c)
    col += 1

Now I can see that the variables 0 and 1 are more Gaussian than shown in the histogram. Variables 2 and 3 also look a bit close to the Gaussian distribution, except for two peaks.

Method 3: Q-Q diagram

The Q-Q plot plots the data according to the specified distribution. In this case, the specified distribution will be "norm".

In Python, Q-Q plot can be drawn using the 'probplot' function of 'scipy'. As shown below.

from scipy.stats import probplotfor i in X.columns:
    probplot(x=X[i],dist='norm',plot=plt)
    plt.title(i)
    plt.show()



As can be seen from the Q-Q diagram above, variables 0 and 1 closely follow the red line (normal / Gaussian distribution). Variables 2 and 3 are away from the red line in some places, which makes them away from the Gaussian distribution. Q-Q chart is more reliable than histogram and density chart.

Method 4: Shapiro Wilk test

Shapiro Wilk test is a statistical test for normality. This is a quantitative method for testing normality. Shapiro Wilk test tests the null hypothesis: that is, the data is extracted from the normal distribution. To determine whether it is a normal distribution

In Python, you can use the 'shapiro' function of 'scipy' to perform shapiro - wilk verification. As shown below.

from scipy.stats import shapiro

for i in X.columns:
    print(f'{i}: {"Not Gaussian" if shapiro(X[i])[1]<0.05 else "Gaussian"}  {shapiro(X[i])}')

As can be seen from the above results, only variable 1 is Gaussian.

One disadvantage of Shapiro Wilk test is that it is unreliable once the sample size (or the length of the variable) exceeds 5000.

Method 5: Kolmogorov Smirnov test

Kolmogorov Smirnov test is a statistical test of goodness of fit. This test compares two distributions (in this case, one of the two distributions is a Gaussian distribution). The null hypothesis of this test is that the two distributions are the same (or not), and there is no difference between the two distributions.

In Python, the Kolmogorov Smirnov test can be performed using the "kstest" of the "scipy.stats" module, as shown below.

First, we will test the randomly generated normal distribution.

from scipy.stats import kstest

np.random.seed(11)
normal_dist = np.random.randn(1000)
pd.Series(normal_dist).plot(kind="kde")
print(f'{"Not Gaussian" if kstest(normal_dist,"norm")[1]<0.05 else "Gaussian"}  {kstest(normal_dist,"norm")}')

Now we will test Iris data.

from scipy.stats import kstest

for i in X.columns:
    print(f'{i}: {"Not Gaussian" if kstest(X[i].values,"norm")[1]<0.05 else "Gaussian"}  {kstest(X[i].values,"norm")}')

The above results show that no variables have Gaussian distribution. Kolmogorov Smirnov test expects the input variables to have an ideal normal distribution.

Method 6: D'Agostino and Pearson's method

This method uses skewness and kurtosis to test normality. The null hypothesis of this test is that the distribution is derived from the normal distribution.

In Python, you can use the "normal test" function of the "scipy.stats" module to perform this test, as shown below.

from scipy.stats import normaltest
for i in X.columns:
    print(f'{i}: {"Not Gaussian" if normaltest(X[i].values,)[1]<0.05 else "Gaussian"}  {normaltest(X[i].values)}')

The above results show that the variables 0 and 1 are Gaussian. This test does not expect the distribution to be completely normal, but close to normal.

summary

These are some of the many methods used to test the normality of data. I personally prefer to combine all the above methods to determine whether the distribution of variables is Gaussian distribution, and keep in mind the data, problems and models used.

Author: KSV Muralidhar

Deep hub translation group

Tags: Deep Learning Machine Learning sklearn

Posted by Exemption on Tue, 19 Apr 2022 09:05:21 +0930