# [R Experiment.9] Principal Component and Factor Analysis

The solution is not single. The following methods have Xuan Zi's personal preferences and are therefore only for reference.If there are any errors, please correct them in the comments area!

9.1 Discuss the main structure of urban industry with principal component method.Table 9-4 is the thirteen industries of an industrial department in a city, which are metallurgy (1), electricity (2), coal (3), chemistry (4), machinery (5), building materials (6), forestry (7), food (8), textiles (9), sewing (10), leather (11), paper (12), and cultural and artistic articles (13), with eight indicators, respectively, annual net unfixed assets value of X1(10,000 yuan), forest industry (7), food (8), textiles (9), sewing (10), leather (11), paper (12), and cultural and artistic articles (13).Data on the number of employees X2 (person), total industrial output value X3 (10,000 yuan), total labor productivity X4 (yuan/person-year), fixed original value of 100 yuan to achieve output value X5 (yuan), capital interest rate X6 (%), standard fuel consumption X7 (ton), and energy use efficiency X8 (yuan/ton).

(1) Using principal component analysis to determine several principal components of eight indicators and to interpret them;

```#Import data
> library(psych)
> #Gravel Map - Preliminary Estimated Number of Principal Components
> fa.parallel(data9.1[,], fa="pc", n.iter = 100, main = "Scree plot")
```

By the kaiser-Harris criterion, it is generally necessary to preserve the first two principal components

```#Extraction of principal components
> pc <- principal(data9.1[,], nfactors = 2, covar = FALSE, rotate = "none")> pc
Principal Components Analysis
Call: principal(r = data9.1[, ], nfactors = 2, rotate = "none", covar = FALSE)
PC1   PC2   h2    u2 com
V1  0.84  0.50 0.96 0.041 1.6
V2  0.83  0.47 0.92 0.082 1.6
V3  0.75  0.64 0.97 0.028 2.0
V4 -0.38  0.77 0.73 0.269 1.5
V5 -0.68  0.56 0.79 0.214 1.9
V6 -0.62  0.69 0.86 0.144 2.0
V7  0.38 -0.64 0.56 0.444 1.6
V8  0.10  0.46 0.22 0.775 1.1

PC1  PC2
Proportion Var        0.39 0.36
Cumulative Var        0.39 0.75
Proportion Explained  0.52 0.48
Cumulative Proportion 0.52 1.00

Mean item complexity =  1.7
Test of the hypothesis that 2 components are sufficient.

The root mean square of the residuals (RMSR) is  0.09
with the empirical chi square  5.62  with prob <  0.96
Fit based upon off diagonal values = 0.96
```

As you can see from the code PC1 and PC2 columns, the first principal component explains 39% of the variance of the main structure of the city's industry, while the second principal component explains 36%, accounting for 75% of the variance, which is good for a macroeconomic problem.

```> #When multiple components are extracted, to make the results more explanatory, rotate as follows
> pc <- principal(data9.1[,], nfactors = 2, covar = FALSE, rotate = "varimax")
> pc
Principal Components Analysis
Call: principal(r = data9.1[, ], nfactors = 2, rotate = "varimax",
covar = FALSE)
RC1   RC2   h2    u2 com
V1  0.98 -0.09 0.96 0.041 1.0
V2  0.95 -0.11 0.92 0.082 1.0
V3  0.98  0.08 0.97 0.028 1.0
V4  0.15  0.84 0.73 0.269 1.1
V5 -0.22  0.86 0.79 0.214 1.1
V6 -0.09  0.92 0.86 0.144 1.0
V7 -0.08 -0.74 0.56 0.444 1.0
V8  0.35  0.32 0.22 0.775 2.0

RC1  RC2
Proportion Var        0.38 0.37
Cumulative Var        0.38 0.75
Proportion Explained  0.51 0.49
Cumulative Proportion 0.51 1.00
```

The explanatory power of the cumulative variance of the two components did not change (75%) after rotation, but only the nodal power of each principal component to the variance: the first principal component explained 38% of the variance of the main structure of the urban industry, and the second principal component explained 37%.

pc1 = 0.98x1 + 0.95x2 + 0.98x3 + 0.15x4 - 0.22x5 - 0.09x6 - 0.08x7 + 0.35x8

Pc2 = -0.09x1 - 0.11x2 + 0.08x3 + 0.84x4 + 0.86x5 + 0.92x6 - 0.74x7 + 0.32x8

From the above, it can be explained that the principal component one is industrial scale and the principal component two is output benefit.

Note: Three principal components can also be selected for the above question because the title does not specify the maximum information loss rate.
I'm going to use two principal components here because I think more than 70 percent of the variance explanations are sufficient for a macro-problem.

(3) Sort and classify 13 industries by principal component score.
The principal component score can be obtained by substituting values with the above formulas
For example, calculating the first principal component score of the metallurgical industry:

```> pc1 = 0.98*90342 + 0.95*52455 + 0.98*101091 + 0.15*19272 - 0.22*82.0 - 0.09*16.1  - 0.08*197435 + 0.35*0.172
> pc1
[1]224513.2

Other methods are similar, omit here

#Get principal component score
> pc <- principal(data9.1[,], nfactors = 2, scores = TRUE)
> pc\$scores
RC1        RC2
[1,]  0.93850029 -0.1370301
[2,] -0.67434595 -1.3940307
[3,] -0.63978735 -1.8817249
[4,]  0.62224308  0.4055288
[5,]  2.85780141 -0.4520175
[6,] -0.46651650 -0.9503237
[7,] -0.61469825  0.2202598
[8,] -0.22232063  1.8001172
[9,] -0.07279029  0.7067208
[10,] -0.64167788  1.0737326
[11,] -0.58912765 -0.1177322
[12,] -0.53868266  0.4183384
[13,]  0.04140237  0.3081615
```

9.2 To investigate the sales volume Y of a certain type of consumer goods in a certain region, it is related to the following four variables: X1 - disposable income of residents, X2 - price index of this type of consumer goods, X3 - amount of the consumer goods in society, X4 - average price index of other consumer goods, as shown in table 9-5.The regression equation between sales volume and four variables X1, X2, X3, X4 was established by principal component regression.

1. Find the principal component first

```>data9.2 <- read.table("data9.2.txt")
> library(psych)
> #Gravel Map - Preliminary Prediction of Number of Principal Components
> fa.parallel(data9.2[,-5], fa="pc", n.iter = 100, main = "Scree plot")
```

A principal component is expected (kaiser-harris criterion)

```pc <- principal(data9.2[,-5], nfactors = 1, covar = FALSE, rotate = "none")
> pc
Principal Components Analysis
Call: principal(r = data9.2[, -5], nfactors = 1, rotate = "none", covar = FALSE)
PC1   h2     u2 com
V1 1.00 0.99 0.0078   1
V2 0.99 0.99 0.0149   1
V3 0.99 0.98 0.0221   1
V4 0.99 0.99 0.0114   1

PC1
Proportion Var 0.99

Proportion Var=0.99，Expectations are correct, the first principal component is expressed as

PC1 = x1 + 0.99(x2+x3+x4)，Compute vectors with raw data z by
> x1 <- data9.2\$V1
> x2 <- data9.2\$V2; x3 <- data9.2\$V3;
> x4 <- data9.2\$V4
> y <- data9.2\$V5
> z <- x1 + 0.99*(x2+x3+x4)#R itself applies to vector operations

2.Principal component regression
>fit <- lm(y ~ z)
> summary(fit)
Call:
lm(formula = y ~ z)

Residuals:
Min      1Q  Median      3Q     Max
-0.6233 -0.3120 -0.0141  0.3308  0.5571

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -11.049619   0.809174  -13.65 7.96e-07 ***
z             0.068333   0.002176   31.41 1.15e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4132 on 8 degrees of freedom
Multiple R-squared:  0.992,	Adjusted R-squared:  0.9909
F-statistic: 986.4 on 1 and 8 DF,  p-value: 1.149e-09
From the top, explain the variable z adopt t Verify that the overall model passes F Verify and adjust the goodness of fit>0.99，The model is well interpreted.The expression is:
Y = -11.049619 + 0.068333
```

9.3 In order to examine students'learning, the school randomly selected 12 students' results in 5 final examinations. The data are shown in Table 9-6. This data is analyzed by using factor analysis method.

(1) Find out the common factors of the five courses and give reasonable explanations;
1. Determine the number of common factors to extract

```data9.3 <- read.table("data9.3.txt",header=T)
> library(psych)
> #Factor graphics show both principal components and public factor analysis results
> fa.parallel(data9.3[,-1], n.obs = 112, fa="both", n.iter = 100)
```

Observing EFA results clearly requires two factors to be extracted.The first two eigenvalues (triangles) of the debris test are all above the corner and are larger than the mean of the eigenvalues based on 100 simulated data matrices. (For EFA, the number of kaiser-harris criterion eigenvalues is greater than 0, not 1)

3. Extracting Common Factors

```#Extracting the Unrotated Factor Using the Spindle Iteration Method (fm="pa)
> fa <- fa(data9.3[,-1], nfactors = 2, rotate = "none", fm="pa")
> fa
Factor Analysis using method =  pa
Call: fa(r = data9.3[, -1], nfactors = 2, rotate = "none", fm = "pa")
PA1   PA2   h2     u2 com
Political 0.82 -0.65 1.09 -0.092 1.9
Language 0.84 -0.30 0.80  0.199 1.2
Foreign Language 0.68  0.02 0.46  0.535 1.0
Math 0.90  0.55 1.11 -0.108 1.7
Physical 0.62  0.44 0.58  0.419 1.8

PA1  PA2
Proportion Var        0.61 0.20
Cumulative Var        0.61 0.81
Proportion Explained  0.75 0.25
Cumulative Proportion 0.75 1.00
```

As you can see, the two factors account for 81% of the variance in end-term results.However, the significance of factor loading is not well explained, so factor rotation is used.

```fa.varimax <- fa(data9.3[,-1], nfactors = 2, rotate = "varimax", fm="pa")
> fa.varimax
Factor Analysis using method =  pa
Call: fa(r = data9.3[, -1], nfactors = 2, rotate = "varimax", fm = "pa")
PA1  PA2   h2     u2 com
Politics 1.04 0.10 1.09 -0.092 1.0
Language 0.81 0.37 0.80  0.199 1.4
Foreign Language 0.48 0.49 0.46  0.535 2.0
Math 0.27 1.02 1.11 -0.108 1.1
Physical 0.14 0.75 0.58  0.419 1.1

PA1  PA2
Proportion Var        0.41 0.40
Cumulative Var        0.41 0.81
Proportion Explained  0.51 0.49
Cumulative Proportion 0.51 1.00
```

The results show that politics and language are heavily loaded on the first factor, physics and mathematics are heavily loaded on the second factor, and foreign languages are more evenly loaded on the two factors, indicating that there is a linguistic intelligence factor and a logical intelligence factor.

(2) Factor scores of samples were calculated by regression or Bartlett methods, and scatterplots of common factors 1 and 2 of factor scores were drawn to analyze the learning of the 12 students.
The upper two common factors can be expressed as:
Pa1 = 1.04x1 + 0.81x2 + 0.48x3 + 0.27x4 + 0.14x5
Pa2 = 0.1x1 + 0.37x2 + 0.49x3 + 1.02x4 + 0.75x5

```#Calculating factor score
x1<-data9.3[,2]; x2<-data9.3[,3]; x3 <- data9.3[,4]; x4 <- data9.3[,5]; x5 <- data9.3[,6]

> Pa1 = 1.04*x1 + 0.81*x2 + 0.48*x3 + 0.27*x4 + 0.14*x5
> Pa2 = 0.1*x1 + 0.37*x2 + 0.49*x3 + 1.02*x4 + 0.75*x5
> par(mfrow=c(1,2))
> plot(num <- data9.3[,1],Pa1)
> plot(num <- data9.3[,1],Pa2)
```

We can find that the learning situation of the 12 students is similar in two factors, that is, the common factor 1 score is high, and the common factor 2 also tends to have high scores. (Some are similar to the scenarios when Grade One Literature is not divided into classes)

```#Common factor scatterplot