The solution is not single. The following methods have Xuan Zi's personal preferences and are therefore only for reference.If there are any errors, please correct them in the comments area!
9.1 Discuss the main structure of urban industry with principal component method.Table 9-4 is the thirteen industries of an industrial department in a city, which are metallurgy (1), electricity (2), coal (3), chemistry (4), machinery (5), building materials (6), forestry (7), food (8), textiles (9), sewing (10), leather (11), paper (12), and cultural and artistic articles (13), with eight indicators, respectively, annual net unfixed assets value of X1(10,000 yuan), forest industry (7), food (8), textiles (9), sewing (10), leather (11), paper (12), and cultural and artistic articles (13).Data on the number of employees X2 (person), total industrial output value X3 (10,000 yuan), total labor productivity X4 (yuan/person-year), fixed original value of 100 yuan to achieve output value X5 (yuan), capital interest rate X6 (%), standard fuel consumption X7 (ton), and energy use efficiency X8 (yuan/ton).
(1) Using principal component analysis to determine several principal components of eight indicators and to interpret them;
#Import data >data9.1 <- read.table("data9.1.txt",header=FALSE) > library(psych) > #Gravel Map - Preliminary Estimated Number of Principal Components > fa.parallel(data9.1[,], fa="pc", n.iter = 100, main = "Scree plot")
By the kaiser-Harris criterion, it is generally necessary to preserve the first two principal components
#Extraction of principal components > pc <- principal(data9.1[,], nfactors = 2, covar = FALSE, rotate = "none")> pc Principal Components Analysis Call: principal(r = data9.1[, ], nfactors = 2, rotate = "none", covar = FALSE) Standardized loadings (pattern matrix) based upon correlation matrix PC1 PC2 h2 u2 com V1 0.84 0.50 0.96 0.041 1.6 V2 0.83 0.47 0.92 0.082 1.6 V3 0.75 0.64 0.97 0.028 2.0 V4 -0.38 0.77 0.73 0.269 1.5 V5 -0.68 0.56 0.79 0.214 1.9 V6 -0.62 0.69 0.86 0.144 2.0 V7 0.38 -0.64 0.56 0.444 1.6 V8 0.10 0.46 0.22 0.775 1.1 PC1 PC2 SS loadings 3.10 2.90 Proportion Var 0.39 0.36 Cumulative Var 0.39 0.75 Proportion Explained 0.52 0.48 Cumulative Proportion 0.52 1.00 Mean item complexity = 1.7 Test of the hypothesis that 2 components are sufficient. The root mean square of the residuals (RMSR) is 0.09 with the empirical chi square 5.62 with prob < 0.96 Fit based upon off diagonal values = 0.96
As you can see from the code PC1 and PC2 columns, the first principal component explains 39% of the variance of the main structure of the city's industry, while the second principal component explains 36%, accounting for 75% of the variance, which is good for a macroeconomic problem.
> #When multiple components are extracted, to make the results more explanatory, rotate as follows > pc <- principal(data9.1[,], nfactors = 2, covar = FALSE, rotate = "varimax") > pc Principal Components Analysis Call: principal(r = data9.1[, ], nfactors = 2, rotate = "varimax", covar = FALSE) Standardized loadings (pattern matrix) based upon correlation matrix RC1 RC2 h2 u2 com V1 0.98 -0.09 0.96 0.041 1.0 V2 0.95 -0.11 0.92 0.082 1.0 V3 0.98 0.08 0.97 0.028 1.0 V4 0.15 0.84 0.73 0.269 1.1 V5 -0.22 0.86 0.79 0.214 1.1 V6 -0.09 0.92 0.86 0.144 1.0 V7 -0.08 -0.74 0.56 0.444 1.0 V8 0.35 0.32 0.22 0.775 2.0 RC1 RC2 SS loadings 3.03 2.97 Proportion Var 0.38 0.37 Cumulative Var 0.38 0.75 Proportion Explained 0.51 0.49 Cumulative Proportion 0.51 1.00
The explanatory power of the cumulative variance of the two components did not change (75%) after rotation, but only the nodal power of each principal component to the variance: the first principal component explained 38% of the variance of the main structure of the urban industry, and the second principal component explained 37%.
pc1 = 0.98x1 + 0.95x2 + 0.98x3 + 0.15x4 - 0.22x5 - 0.09x6 - 0.08x7 + 0.35x8
Pc2 = -0.09x1 - 0.11x2 + 0.08x3 + 0.84x4 + 0.86x5 + 0.92x6 - 0.74x7 + 0.32x8
From the above, it can be explained that the principal component one is industrial scale and the principal component two is output benefit.
Note: Three principal components can also be selected for the above question because the title does not specify the maximum information loss rate.
I'm going to use two principal components here because I think more than 70 percent of the variance explanations are sufficient for a macro-problem.
(3) Sort and classify 13 industries by principal component score.
The principal component score can be obtained by substituting values with the above formulas
For example, calculating the first principal component score of the metallurgical industry:
> pc1 = 0.98*90342 + 0.95*52455 + 0.98*101091 + 0.15*19272 - 0.22*82.0 - 0.09*16.1 - 0.08*197435 + 0.35*0.172 > pc1 [1]224513.2 Other methods are similar, omit here #Get principal component score > pc <- principal(data9.1[,], nfactors = 2, scores = TRUE) > pc$scores RC1 RC2 [1,] 0.93850029 -0.1370301 [2,] -0.67434595 -1.3940307 [3,] -0.63978735 -1.8817249 [4,] 0.62224308 0.4055288 [5,] 2.85780141 -0.4520175 [6,] -0.46651650 -0.9503237 [7,] -0.61469825 0.2202598 [8,] -0.22232063 1.8001172 [9,] -0.07279029 0.7067208 [10,] -0.64167788 1.0737326 [11,] -0.58912765 -0.1177322 [12,] -0.53868266 0.4183384 [13,] 0.04140237 0.3081615
9.2 To investigate the sales volume Y of a certain type of consumer goods in a certain region, it is related to the following four variables: X1 - disposable income of residents, X2 - price index of this type of consumer goods, X3 - amount of the consumer goods in society, X4 - average price index of other consumer goods, as shown in table 9-5.The regression equation between sales volume and four variables X1, X2, X3, X4 was established by principal component regression.
1. Find the principal component first
>data9.2 <- read.table("data9.2.txt") > library(psych) > #Gravel Map - Preliminary Prediction of Number of Principal Components > fa.parallel(data9.2[,-5], fa="pc", n.iter = 100, main = "Scree plot")
A principal component is expected (kaiser-harris criterion)
pc <- principal(data9.2[,-5], nfactors = 1, covar = FALSE, rotate = "none") > pc Principal Components Analysis Call: principal(r = data9.2[, -5], nfactors = 1, rotate = "none", covar = FALSE) Standardized loadings (pattern matrix) based upon correlation matrix PC1 h2 u2 com V1 1.00 0.99 0.0078 1 V2 0.99 0.99 0.0149 1 V3 0.99 0.98 0.0221 1 V4 0.99 0.99 0.0114 1 PC1 SS loadings 3.94 Proportion Var 0.99 Proportion Var=0.99,Expectations are correct, the first principal component is expressed as PC1 = x1 + 0.99(x2+x3+x4),Compute vectors with raw data z by > x1 <- data9.2$V1 > x2 <- data9.2$V2; x3 <- data9.2$V3; > x4 <- data9.2$V4 > y <- data9.2$V5 > z <- x1 + 0.99*(x2+x3+x4)#R itself applies to vector operations 2.Principal component regression >fit <- lm(y ~ z) > summary(fit) Call: lm(formula = y ~ z) Residuals: Min 1Q Median 3Q Max -0.6233 -0.3120 -0.0141 0.3308 0.5571 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -11.049619 0.809174 -13.65 7.96e-07 *** z 0.068333 0.002176 31.41 1.15e-09 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.4132 on 8 degrees of freedom Multiple R-squared: 0.992, Adjusted R-squared: 0.9909 F-statistic: 986.4 on 1 and 8 DF, p-value: 1.149e-09 From the top, explain the variable z adopt t Verify that the overall model passes F Verify and adjust the goodness of fit>0.99,The model is well interpreted.The expression is: Y = -11.049619 + 0.068333
9.3 In order to examine students'learning, the school randomly selected 12 students' results in 5 final examinations. The data are shown in Table 9-6. This data is analyzed by using factor analysis method.
(1) Find out the common factors of the five courses and give reasonable explanations;
1. Determine the number of common factors to extract
data9.3 <- read.table("data9.3.txt",header=T) > library(psych) > #Factor graphics show both principal components and public factor analysis results > fa.parallel(data9.3[,-1], n.obs = 112, fa="both", n.iter = 100)
Observing EFA results clearly requires two factors to be extracted.The first two eigenvalues (triangles) of the debris test are all above the corner and are larger than the mean of the eigenvalues based on 100 simulated data matrices. (For EFA, the number of kaiser-harris criterion eigenvalues is greater than 0, not 1)
3. Extracting Common Factors
#Extracting the Unrotated Factor Using the Spindle Iteration Method (fm="pa) > fa <- fa(data9.3[,-1], nfactors = 2, rotate = "none", fm="pa") > fa Factor Analysis using method = pa Call: fa(r = data9.3[, -1], nfactors = 2, rotate = "none", fm = "pa") Standardized loadings (pattern matrix) based upon correlation matrix PA1 PA2 h2 u2 com Political 0.82 -0.65 1.09 -0.092 1.9 Language 0.84 -0.30 0.80 0.199 1.2 Foreign Language 0.68 0.02 0.46 0.535 1.0 Math 0.90 0.55 1.11 -0.108 1.7 Physical 0.62 0.44 0.58 0.419 1.8 PA1 PA2 SS loadings 3.04 1.01 Proportion Var 0.61 0.20 Cumulative Var 0.61 0.81 Proportion Explained 0.75 0.25 Cumulative Proportion 0.75 1.00
As you can see, the two factors account for 81% of the variance in end-term results.However, the significance of factor loading is not well explained, so factor rotation is used.
fa.varimax <- fa(data9.3[,-1], nfactors = 2, rotate = "varimax", fm="pa") > fa.varimax Factor Analysis using method = pa Call: fa(r = data9.3[, -1], nfactors = 2, rotate = "varimax", fm = "pa") Standardized loadings (pattern matrix) based upon correlation matrix PA1 PA2 h2 u2 com Politics 1.04 0.10 1.09 -0.092 1.0 Language 0.81 0.37 0.80 0.199 1.4 Foreign Language 0.48 0.49 0.46 0.535 2.0 Math 0.27 1.02 1.11 -0.108 1.1 Physical 0.14 0.75 0.58 0.419 1.1 PA1 PA2 SS loadings 2.06 1.98 Proportion Var 0.41 0.40 Cumulative Var 0.41 0.81 Proportion Explained 0.51 0.49 Cumulative Proportion 0.51 1.00
The results show that politics and language are heavily loaded on the first factor, physics and mathematics are heavily loaded on the second factor, and foreign languages are more evenly loaded on the two factors, indicating that there is a linguistic intelligence factor and a logical intelligence factor.
(2) Factor scores of samples were calculated by regression or Bartlett methods, and scatterplots of common factors 1 and 2 of factor scores were drawn to analyze the learning of the 12 students.
The upper two common factors can be expressed as:
Pa1 = 1.04x1 + 0.81x2 + 0.48x3 + 0.27x4 + 0.14x5
Pa2 = 0.1x1 + 0.37x2 + 0.49x3 + 1.02x4 + 0.75x5
#Calculating factor score x1<-data9.3[,2]; x2<-data9.3[,3]; x3 <- data9.3[,4]; x4 <- data9.3[,5]; x5 <- data9.3[,6] > Pa1 = 1.04*x1 + 0.81*x2 + 0.48*x3 + 0.27*x4 + 0.14*x5 > Pa2 = 0.1*x1 + 0.37*x2 + 0.49*x3 + 1.02*x4 + 0.75*x5 > par(mfrow=c(1,2)) > plot(num <- data9.3[,1],Pa1) > plot(num <- data9.3[,1],Pa2)
We can find that the learning situation of the 12 students is similar in two factors, that is, the common factor 1 score is high, and the common factor 2 also tends to have high scores. (Some are similar to the scenarios when Grade One Literature is not divided into classes)
#Common factor scatterplot factor.plot(fa.varimax,labels = rownames(fa.varimax$loadings))
The diagram shows that the two factors have a reverse change relationship (this goes against each other), and from the side, it also shows the rationality of classified teaching of literature and science.