1. Origin
Factor analysis was first published by the British psychologist C.Spearman in the article "statistical analysis of intelligence test scores", from which he proposed that he found that students' English, French and classical language scores were very relevant. He believed that there was a common factor driving these three courses, and finally defined this factor as "language ability". Thus, the prelude of factor analysis is untied.
2. Basic thought
Factor analysis explores the basic structure of observation data by studying the internal dependence between many variables, and uses a few hypothetical variables (factors) to represent the basic data structure. The original variable is an observable explicit variable, while the hypothetical variable is an unobservable potential variable, which is called factor.
For example, in the research of enterprise brand image, consumers can evaluate the advantages and disadvantages of all aspects of the mall through the evaluation system composed of 24 indicators, but consumers will mainly care about three fangminas, namely the store environment, the store service and the price of goods. Factor analysis can find out three potential factors that reflect the store environment, store service level and commodity price through 24 variables to comprehensively evaluate the store.
$ X{i} = u{_i} + a{_i}{_1}F{_1} + a{_i}{_2}F{_2} + a{_i}{_3}F{_3} + e{_i} $
$F{_1}, F{_2}, F{_3} $are factors, and the parts \ (e{_i} \) that are not included are called special factors.
3. Characteristics of factor analysis
-
Number of factor variables < number of original index variables.
-
Factor variables are not the choice of the original variables, but are recombined according to the information of the original variables, which can reflect most of the information of the original variables.
-
There is no linear correlation between factor variables.
-
The factor variable is named interpretative, that is, the variable is the synthesis and response to some original variable information.
4. Algorithm usage
-
Reduce dimension and reduce the number of analysis variables
-
Classification, which divides the internal variables / samples with correlation into one category
5. Analysis steps
-
a. The analysis variables were selected and standardized
-
b. Calculate the correlation coefficient matrix between variables, eigenvalues and eigenvectors of the correlation coefficient matrix
-
c. Reliability test: KMO and Bartlett sphericity test to verify whether the variables are suitable for factor analysis.
Kmo (Kaiser Meyer Olkin) test
KMO test statistic is an index used to compare simple correlation coefficient matrix and partial correlation coefficient between variables. The mathematical definition is:
Whether it is suitable for factor analysis: KMO above 0.9 is very suitable; 0.8 means suitable, 0.7 means average, and 0.6 means not suitable; Below 0.5 indicates extremely unsuitable.
Bartlett sphericity test
The test takes the correlation coefficient matrix of the original variables as the starting point, and its null hypothesis \ (H_0 \) is that the correlation coefficient matrix is a unit matrix, that is, the main diagonal elements of the correlation coefficient matrix are 1 and the non main diagonal elements are 0 (that is, there is no correlation between the original variables).
According to the test, the statistics obey chi square distribution. If the chi square value of the statistics is large and the corresponding sig value is less than the given significance level \ (\ alpha \), the null hypothesis is not tenable. That is to say, the correlation coefficient matrix is unlikely to be an identity matrix, and there is a correlation between variables, which is suitable for factor analysis.
-
d. Extract common factors: only take factors with variance > 1 or eigenvalue > 1 (the contribution of factors with variance less than 1 may be very small.); The cumulative variance contribution rate of the factor is 80%.
-
e. Factor rotation: the practical significance of extracting factors is easier to explain
Orthogonal rotation and oblique rotation, in which the maximum variance rotation method of orthogonal rotation is mainly used.
- f. Calculate factor score
6. Application examples
Library for factor analysis factor_analyzer
pip install factor_analyzer
6.1 data processing
import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline
# Import data df = pd.read_csv(r'C:\Users\Desktop\bfi.csv') df.head()
# Delete irrelevant columns df.drop(["gender", "education", "age", "Unnamed: 0"], axis=1, inplace=True)
# Check for missing values df.isnull().sum()
You can see that there are some missing values in the data, which need to be deleted
# Delete missing values df.dropna(inplace=True) df.shape
After processing the missing value, the sample data size is 2436 samples * 25 variables.
6.2 reliability test
# Import required libraries from factor_analyzer import FactorAnalyzer from factor_analyzer.factor_analyzer import calculate_kmo from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
# Factor analysis reliability test kmo_all, kmo_model = calculate_kmo(df) chi_square_value, p_value = calculate_bartlett_sphericity(df) print("kmo_all:", kmo_all, end="\n\n") print("kmo_model:", kmo_model, end="\n\n") print("chi_square_value:", chi_square_value, end="\n\n") print("p_value:", p_value)
KMO test shows that the value of KMO statistic is 0.85, indicating that the data is suitable for factor analysis; At the same time, the p value of Bartlett spherical test is 0, rejecting the original hypothesis (the correlation coefficient matrix is the unit matrix), and there is a correlation between variables, which is suitable for factor analysis.
6.3 extraction of common factors
Exploratory factor analysis was conducted to determine the number of common factors extracted. Calculate the eigenvalues and eigenvectors of the correlation coefficient matrix
# Exploratory factor analysis fa = FactorAnalyzer(25, rotation=None) fa.fit(df) # Eigenvalues and eigenvectors of correlation coefficient matrix ev, v = fa.get_eigenvalues() ev, v # According to the feature root >1, 6 common factors can be extracted
Draw gravel map to further determine the number of factors
# Draw the gravel map and select the number of factors plt.scatter(range(1,df.shape[1]+1),ev) plt.plot(range(1,df.shape[1]+1),ev) plt.title('Scree Plot') plt.xlabel('Factors') plt.ylabel('Eigenvalue') plt.grid() plt.show()
6.4 factor rotation
A clear factor analysis model is established, and the factors are rotated orthogonally to maximize the variance.
# Establish factor analysis model fa_six = FactorAnalyzer(6, rotation="varimax") fa_six.fit(df) # Load of output factor fa_six.loadings_ # pd.DataFrame(fa_six.loadings_, index=df.columns)
The results do not look intuitive. It is impossible to see which factors have a high degree of interpretation of variables, and the data can be visualized to show the results.
import seaborn as sns df_cm = pd.DataFrame(np.abs(fa_six.loadings_), index=df.columns) plt.figure(figsize = (14,14)) ax = sns.heatmap(df_cm, annot=True, cmap="BuPu") # Sets the font size for the y axis ax.yaxis.set_tick_params(labelsize=15) plt.title('Factor Analysis', fontsize='xx-large') # Set y-axis label plt.ylabel('Personality items', fontsize='xx-large') # Save picture # plt.savefig(r'C:\Users\Desktop\factorAnalysis.png', dpi=500)
From the above figure, factor 6 has no high load on all variables and is not easy to explain. It is necessary to adjust the number of factors and select 5 common factors. Repeat the above steps:
# Establish a factor analysis model and set the number of common factors to 5 fa_five = FactorAnalyzer(5, rotation="varimax") fa_five.fit(df) import seaborn as sns df_cm = pd.DataFrame(np.abs(fa_five.loadings_), index=df.columns) plt.figure(figsize = (14,14)) ax = sns.heatmap(df_cm, annot=True, cmap="BuPu") # Sets the font size for the y axis ax.yaxis.set_tick_params(labelsize=15) plt.title('Factor Analysis', fontsize='xx-large') # Set y-axis label plt.ylabel('Personality items', fontsize='xx-large')
According to the results above:
Factor 1 has a high load on variables (N1, N2, N3, N4, N5), so factor 1 can be defined as neurotic factor.
Factor 2 has a high load on variables (E1, E2, E3, E4, E5), so factor 2 can be defined as an export-oriented factor.
Factor 3 has a high load on variables (C1, C2, C3, C4, C5), so factor 3 can be defined as a due diligence factor.
Factor 4 has a high load on variables (A1, A2, A3, A4, A5), so factor 4 can be defined as identity factor.
Factor 5 has a high load on variables (O1, O2, O3, O4, O5), so factor 5 can be defined as an open factor.
# Cumulative contribution of variance fa_v = fa_five.get_factor_variance() fa_dt = pd.DataFrame({ "Characteristic root": fa_v[0], "Variance contribution rate": fa_v[1], "Cumulative contribution rate of variance": fa_v[2] }) fa_dt
The sum of the eigenvalues of the five factors accounts for 42.36% of the total eigenvalues. It can also be said that the five factors explain 42.36% of the information of all variables.
6.5 calculate factor score
# Calculate factor score score = fa_five.transform(df) score
# Calculate the comprehensive score x = score @ (fa_v[1]) result = pd.DataFrame(x, columns=["Comprehensive score"], index=df.index) result.sort_values(by="Comprehensive score", ascending=False, inplace=True) result
Full code:
import numpy as np import pandas as pd from factor_analyzer import FactorAnalyzer from factor_analyzer.factor_analyzer import calculate_kmo from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity import matplotlib.pyplot as plt %matplotlib inline # Import data df = pd.read_csv(r'C:\Users\wy\Desktop\bfi.csv') df.head() # Delete irrelevant columns df.drop(["gender", "education", "age", "Unnamed: 0"], axis=1, inplace=True) # Check for missing values df.isnull().sum() # Delete missing values df.dropna(inplace=True) # Factor analysis reliability test kmo_all, kmo_model = calculate_kmo(df) chi_square_value, p_value = calculate_bartlett_sphericity(df) print("kmo_all:", kmo_all, end="\n\n") print("kmo_model:", kmo_model, end="\n\n") print("chi_square_value:", chi_square_value, end="\n\n") print("p_value:", p_value) # Exploratory factor analysis fa = FactorAnalyzer(25, rotation=None) fa.fit(df) # Eigenvalues and eigenvectors of correlation coefficient matrix ev, v = fa.get_eigenvalues() ev, v # According to the feature root > 1, six common factors can be extracted # Draw the gravel map and select the number of factors plt.scatter(range(1,df.shape[1]+1),ev) plt.plot(range(1,df.shape[1]+1),ev) plt.title('Scree Plot') plt.xlabel('Factors') plt.ylabel('Eigenvalue') plt.grid() plt.show() # Establish factor analysis model fa_six = FactorAnalyzer(6, rotation="varimax") fa_six.fit(df) # Load of output factor fa_six.loadings_ # Establish factor analysis model fa_five = FactorAnalyzer(5, rotation="varimax") # According to the thermodynamic diagram of six common factors, it is found that Factor6 has no load on each variable. Therefore, it is adjusted to 5 common factors. fa_five.fit(df) import seaborn as sns df_cm = pd.DataFrame(np.abs(fa_five.loadings_), index=df.columns) plt.figure(figsize = (14,14)) ax = sns.heatmap(df_cm, annot=True, cmap="BuPu") # Sets the font size for the y axis ax.yaxis.set_tick_params(labelsize=15) plt.title('Factor Analysis', fontsize='xx-large') # Set y-axis label plt.ylabel('Personality items', fontsize='xx-large') # Cumulative contribution of variance fa_v = fa_five.get_factor_variance() fa_dt = pd.DataFrame({ "Characteristic root": fa_v[0], "Variance contribution rate": fa_v[1], "Cumulative contribution rate of variance": fa_v[2] }) fa_dt # Calculate factor score score = fa_five.transform(df) score # Calculate the comprehensive score x = score @ (fa_v[1]) result = pd.DataFrame(x, columns=["Comprehensive score"], index=df.index) result.sort_values(by="Comprehensive score", ascending=False, inplace=True) result
reference resources: