python factor analysis

1. Origin

Factor analysis was first published by the British psychologist C.Spearman in the article "statistical analysis of intelligence test scores", from which he proposed that he found that students' English, French and classical language scores were very relevant. He believed that there was a common factor driving these three courses, and finally defined this factor as "language ability". Thus, the prelude of factor analysis is untied.

2. Basic thought

Factor analysis explores the basic structure of observation data by studying the internal dependence between many variables, and uses a few hypothetical variables (factors) to represent the basic data structure. The original variable is an observable explicit variable, while the hypothetical variable is an unobservable potential variable, which is called factor.

For example, in the research of enterprise brand image, consumers can evaluate the advantages and disadvantages of all aspects of the mall through the evaluation system composed of 24 indicators, but consumers will mainly care about three fangminas, namely the store environment, the store service and the price of goods. Factor analysis can find out three potential factors that reflect the store environment, store service level and commodity price through 24 variables to comprehensively evaluate the store.

$ X{i} = u{_i} + a{_i}{_1}F{_1} + a{_i}{_2}F{_2} + a{_i}{_3}F{_3} + e{_i} $

$F{_1}, F{_2}, F{_3} $are factors, and the parts \ (e{_i} \) that are not included are called special factors.

3. Characteristics of factor analysis

  • Number of factor variables < number of original index variables.

  • Factor variables are not the choice of the original variables, but are recombined according to the information of the original variables, which can reflect most of the information of the original variables.

  • There is no linear correlation between factor variables.

  • The factor variable is named interpretative, that is, the variable is the synthesis and response to some original variable information.

4. Algorithm usage

  • Reduce dimension and reduce the number of analysis variables

  • Classification, which divides the internal variables / samples with correlation into one category

5. Analysis steps

  • a. The analysis variables were selected and standardized

  • b. Calculate the correlation coefficient matrix between variables, eigenvalues and eigenvectors of the correlation coefficient matrix

  • c. Reliability test: KMO and Bartlett sphericity test to verify whether the variables are suitable for factor analysis.

Kmo (Kaiser Meyer Olkin) test

KMO test statistic is an index used to compare simple correlation coefficient matrix and partial correlation coefficient between variables. The mathematical definition is:

\[KMO{_i} = \frac{\sum\sum_{{_j}{_≠}{_i}} r^2 _{_i}{_j}}{\sum\sum_{{_j}{_≠}{_i}} r^2 _{_i}{_j} + \sum\sum_{{_j}{_≠}{_i}} p^2 _{_i}{_j}} \]

Whether it is suitable for factor analysis: KMO above 0.9 is very suitable; 0.8 means suitable, 0.7 means average, and 0.6 means not suitable; Below 0.5 indicates extremely unsuitable.

Bartlett sphericity test

The test takes the correlation coefficient matrix of the original variables as the starting point, and its null hypothesis \ (H_0 \) is that the correlation coefficient matrix is a unit matrix, that is, the main diagonal elements of the correlation coefficient matrix are 1 and the non main diagonal elements are 0 (that is, there is no correlation between the original variables).

According to the test, the statistics obey chi square distribution. If the chi square value of the statistics is large and the corresponding sig value is less than the given significance level \ (\ alpha \), the null hypothesis is not tenable. That is to say, the correlation coefficient matrix is unlikely to be an identity matrix, and there is a correlation between variables, which is suitable for factor analysis.

  • d. Extract common factors: only take factors with variance > 1 or eigenvalue > 1 (the contribution of factors with variance less than 1 may be very small.); The cumulative variance contribution rate of the factor is 80%.

  • e. Factor rotation: the practical significance of extracting factors is easier to explain

Orthogonal rotation and oblique rotation, in which the maximum variance rotation method of orthogonal rotation is mainly used.

  • f. Calculate factor score

6. Application examples

Library for factor analysis factor_analyzer

pip install factor_analyzer

6.1 data processing

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# Import data
df = pd.read_csv(r'C:\Users\Desktop\bfi.csv')
df.head()

# Delete irrelevant columns
df.drop(["gender", "education", "age", "Unnamed: 0"], axis=1, inplace=True)
# Check for missing values
df.isnull().sum()

You can see that there are some missing values in the data, which need to be deleted

# Delete missing values
df.dropna(inplace=True)
df.shape

After processing the missing value, the sample data size is 2436 samples * 25 variables.

6.2 reliability test

# Import required libraries
from factor_analyzer import FactorAnalyzer
from factor_analyzer.factor_analyzer import calculate_kmo
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
# Factor analysis reliability test

kmo_all, kmo_model = calculate_kmo(df)
chi_square_value, p_value = calculate_bartlett_sphericity(df)

print("kmo_all:", kmo_all, end="\n\n")
print("kmo_model:", kmo_model, end="\n\n")
print("chi_square_value:", chi_square_value, end="\n\n")
print("p_value:", p_value)

KMO test shows that the value of KMO statistic is 0.85, indicating that the data is suitable for factor analysis; At the same time, the p value of Bartlett spherical test is 0, rejecting the original hypothesis (the correlation coefficient matrix is the unit matrix), and there is a correlation between variables, which is suitable for factor analysis.

6.3 extraction of common factors

Exploratory factor analysis was conducted to determine the number of common factors extracted. Calculate the eigenvalues and eigenvectors of the correlation coefficient matrix

# Exploratory factor analysis
fa = FactorAnalyzer(25, rotation=None)
fa.fit(df)

# Eigenvalues and eigenvectors of correlation coefficient matrix
ev, v = fa.get_eigenvalues()
ev, v

# According to the feature root >1, 6 common factors can be extracted

Draw gravel map to further determine the number of factors

# Draw the gravel map and select the number of factors
plt.scatter(range(1,df.shape[1]+1),ev)
plt.plot(range(1,df.shape[1]+1),ev)
plt.title('Scree Plot')
plt.xlabel('Factors')
plt.ylabel('Eigenvalue')
plt.grid()
plt.show()

6.4 factor rotation

A clear factor analysis model is established, and the factors are rotated orthogonally to maximize the variance.

# Establish factor analysis model

fa_six = FactorAnalyzer(6, rotation="varimax")
fa_six.fit(df)

# Load of output factor

fa_six.loadings_

# pd.DataFrame(fa_six.loadings_, index=df.columns)

The results do not look intuitive. It is impossible to see which factors have a high degree of interpretation of variables, and the data can be visualized to show the results.

import seaborn as sns
df_cm = pd.DataFrame(np.abs(fa_six.loadings_), index=df.columns)

plt.figure(figsize = (14,14))
ax = sns.heatmap(df_cm, annot=True, cmap="BuPu")

# Sets the font size for the y axis
ax.yaxis.set_tick_params(labelsize=15)

plt.title('Factor Analysis', fontsize='xx-large')
# Set y-axis label
plt.ylabel('Personality items', fontsize='xx-large')

# Save picture
# plt.savefig(r'C:\Users\Desktop\factorAnalysis.png', dpi=500)

From the above figure, factor 6 has no high load on all variables and is not easy to explain. It is necessary to adjust the number of factors and select 5 common factors. Repeat the above steps:

# Establish a factor analysis model and set the number of common factors to 5
fa_five = FactorAnalyzer(5, rotation="varimax")
fa_five.fit(df)

import seaborn as sns
df_cm = pd.DataFrame(np.abs(fa_five.loadings_), index=df.columns)

plt.figure(figsize = (14,14))
ax = sns.heatmap(df_cm, annot=True, cmap="BuPu")

# Sets the font size for the y axis
ax.yaxis.set_tick_params(labelsize=15)

plt.title('Factor Analysis', fontsize='xx-large')
# Set y-axis label
plt.ylabel('Personality items', fontsize='xx-large')

According to the results above:

Factor 1 has a high load on variables (N1, N2, N3, N4, N5), so factor 1 can be defined as neurotic factor.

Factor 2 has a high load on variables (E1, E2, E3, E4, E5), so factor 2 can be defined as an export-oriented factor.

Factor 3 has a high load on variables (C1, C2, C3, C4, C5), so factor 3 can be defined as a due diligence factor.

Factor 4 has a high load on variables (A1, A2, A3, A4, A5), so factor 4 can be defined as identity factor.

Factor 5 has a high load on variables (O1, O2, O3, O4, O5), so factor 5 can be defined as an open factor.

# Cumulative contribution of variance
fa_v = fa_five.get_factor_variance()
fa_dt = pd.DataFrame({
    "Characteristic root": fa_v[0],
    "Variance contribution rate": fa_v[1],
    "Cumulative contribution rate of variance": fa_v[2]
})

fa_dt

The sum of the eigenvalues of the five factors accounts for 42.36% of the total eigenvalues. It can also be said that the five factors explain 42.36% of the information of all variables.

6.5 calculate factor score

# Calculate factor score
score = fa_five.transform(df)
score

# Calculate the comprehensive score
x = score @ (fa_v[1])
result = pd.DataFrame(x, columns=["Comprehensive score"], index=df.index)
result.sort_values(by="Comprehensive score", ascending=False, inplace=True)
result

Full code:

import numpy as np
import pandas as pd

from factor_analyzer import FactorAnalyzer
from factor_analyzer.factor_analyzer import calculate_kmo
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity

import matplotlib.pyplot as plt
%matplotlib inline


# Import data
df = pd.read_csv(r'C:\Users\wy\Desktop\bfi.csv')
df.head()

# Delete irrelevant columns
df.drop(["gender", "education", "age", "Unnamed: 0"], axis=1, inplace=True)

# Check for missing values
df.isnull().sum()

# Delete missing values
df.dropna(inplace=True)

# Factor analysis reliability test
kmo_all, kmo_model = calculate_kmo(df)
chi_square_value, p_value = calculate_bartlett_sphericity(df)

print("kmo_all:", kmo_all, end="\n\n")
print("kmo_model:", kmo_model, end="\n\n")
print("chi_square_value:", chi_square_value, end="\n\n")
print("p_value:", p_value)

# Exploratory factor analysis
fa = FactorAnalyzer(25, rotation=None)
fa.fit(df)

# Eigenvalues and eigenvectors of correlation coefficient matrix
ev, v = fa.get_eigenvalues()
ev, v

# According to the feature root > 1, six common factors can be extracted

# Draw the gravel map and select the number of factors
plt.scatter(range(1,df.shape[1]+1),ev)
plt.plot(range(1,df.shape[1]+1),ev)
plt.title('Scree Plot')
plt.xlabel('Factors')
plt.ylabel('Eigenvalue')
plt.grid()
plt.show()


# Establish factor analysis model
fa_six = FactorAnalyzer(6, rotation="varimax")
fa_six.fit(df)

# Load of output factor
fa_six.loadings_


# Establish factor analysis model

fa_five = FactorAnalyzer(5, rotation="varimax")  # According to the thermodynamic diagram of six common factors, it is found that Factor6 has no load on each variable. Therefore, it is adjusted to 5 common factors.
fa_five.fit(df)

import seaborn as sns
df_cm = pd.DataFrame(np.abs(fa_five.loadings_), index=df.columns)

plt.figure(figsize = (14,14))
ax = sns.heatmap(df_cm, annot=True, cmap="BuPu")

# Sets the font size for the y axis
ax.yaxis.set_tick_params(labelsize=15)

plt.title('Factor Analysis', fontsize='xx-large')
# Set y-axis label
plt.ylabel('Personality items', fontsize='xx-large')

# Cumulative contribution of variance

fa_v = fa_five.get_factor_variance()
fa_dt = pd.DataFrame({
    "Characteristic root": fa_v[0],
    "Variance contribution rate": fa_v[1],
    "Cumulative contribution rate of variance": fa_v[2]
})

fa_dt

# Calculate factor score
score = fa_five.transform(df)
score

# Calculate the comprehensive score
x = score @ (fa_v[1])
result = pd.DataFrame(x, columns=["Comprehensive score"], index=df.index)
result.sort_values(by="Comprehensive score", ascending=False, inplace=True)
result

reference resources:

Factor analysis -- python

Introduction to Factor Analysis in Python

Factor analysis - attached data download

Tags: Python

Posted by mania on Sun, 17 Apr 2022 15:02:28 +0930