Basic introduction of dummy variable and R language setting

1. Basic introduction of dummy variable

When constructing the regression model, if the independent variable x is a continuous variable, the regression coefficient β It can be explained as follows: under the condition that other independent variables remain unchanged, the average change of dependent variable Y caused by each unit of X change; If the independent variable x is a dichotomous variable, such as whether to drink alcohol (1 = yes, 0 = no), the regression coefficient β It can be explained as: under the condition that other independent variables remain unchanged, the average change of dependent variable Y caused by X=1 (drinkers) compared with X=0 (non drinkers).

However, when the independent variable X is a multi classification variable, such as occupation, education, blood type, disease severity, etc., it is not ideal to use only one regression coefficient to explain the change relationship between the multi classification variables and its impact on the dependent variables.

At this time, we usually convert the original multi classification variables into dummy variables. Each dummy variable only represents the difference between two or several levels. By constructing the regression model, each dummy variable can obtain an estimated regression coefficient, which makes the regression result easier to explain and more practical.

1.1 what is a dummy variable?

Dummy Variable, also known as Dummy Variable, Dummy Variable or nominal variable, can be seen from the name. It is a artificially Dummy Variable, usually valued as 0 or 1, to reflect the different attributes of a variable. For independent variables with n classification attributes, it is usually necessary to select one classification as a reference, so n-1 dummy variables can be generated.

The dummy variable is introduced into the regression model, which makes the model more complex, but it can more intuitively reflect the influence of the different attributes of the independent variable on the dependent variable, and improve the accuracy and accuracy of the model.

For example, if occupational factors are divided into five categories: students, farmers, workers, civil servants and others, with "other occupations" as the reference, four dummy variables X1-X4 need to be set, as shown below:

X1=1, student; X1=0, non student;
X2=1, farmers; X2=0, non farmer;
X3=1, worker; X3=0, non worker;
X4=1, civil servant; X4=0, non civil servant;
Then, for each occupational classification, its assignment can be transformed into the following form:

1.2 under what circumstances do you need to set dummy variables?

(1) For disordered multi classification variables, they need to be transformed into dummy variables when introducing the model

For example, blood types are generally divided into four types: A, B, O and AB, which are disordered multi classification variables. Generally, when entering data, we often assign them as 1, 2, 3 and 4 in order to quantify the data.

From the perspective of numbers, after being assigned as 1, 2, 3 and 4, they have a certain order relationship from small to large. In fact, there is no such size relationship among the four blood groups, and they should be equal and independent of each other. If it is unreasonable to assign values according to 1, 2, 3 and 4 and bring them into the regression model, we need to convert them into dummy variables.

(2) For ordered multi classification variables, it needs to be considered when introducing the model

For example, the severity of disease is generally divided into mild, moderate and severe, which can be considered as ordered multi classification variables. Usually, we often assign it to 1, 2, 3 (equidistant) or 1, 2, 4 (equiratio) and other forms to reflect a certain hierarchical relationship between the severity of disease through the numerical relationship from small to large.

However, it should be noted that once the value is assigned to the above equidistant or proportional numerical form, it is considered that the severity of the disease also presents a similar equidistant or proportional relationship to some extent. In fact, due to the clinical complexity of the disease, there is no strict isometric or proportional relationship between different severity. Therefore, it is unreasonable to assign the value to the above form. At this time, it can be transformed into dummy variables for quantification.

(3) For continuous variables, you can consider setting them as dummy variables during variable conversion

For continuous variables, many people think that they can be directly brought into the regression model, but sometimes we need to make appropriate conversion to continuous variables in combination with the actual clinical significance. For example, when age is brought into the model as a continuous variable, it is interpreted as the impact on the dependent variable when the age increases by one year. However, the effect of increasing the age by one year is very weak and has little practical significance.

At this time, we can discretize the continuous variable of age and divide it into 10-year-old age groups, such as 0-10, 11-20, 21-30, 31-40, etc., and assign each group to 1, 2, 3 and 4. At this time, the regression coefficient of the model can be explained as the influence on the dependent variable when the age increases by 10 years.

The above assignment method is based on the premise that there is a certain linear relationship between age and dependent variables. However, sometimes the following situations may occur. For example, the mortality of a disease is higher in the lower and higher age groups, while the mortality is relatively low in the middle-aged and young people. There is a U-shaped relationship between age and death outcome. At this time, it is unreasonable to assign the age group as 1, 2, 3 and 4.

Therefore, when we cannot determine the change relationship between independent variables and dependent variables and discretize continuous independent variables, we can consider dummy variable conversion.

In another case, for example, when BMI is divided into underweight, normal weight, overweight, obesity and other categories according to clinical diagnostic criteria, because the tangent points divided between different categories are not equidistant, the assignment of 1, 2 and 3 is not in line with the actual situation, and it can also be considered to convert it into dummy variables.

1.3 how to set the reference group of dummy variables?

In the above content, we mentioned that for n classified independent variables, n-1 dummy variables need to be generated. When all n-1 dummy variables have values of 0, this is the nth type attribute of the variable, that is, we use this type of attribute as a reference.

For example, taking the occupational factors mentioned above as an example, it is divided into five categories: students, farmers, workers, civil servants and others, and four dummy variables are set. Among them, the attribute of "other" in the occupational factors, and the value of each dummy variable is 0. At this time, we take the attribute of "other" as a reference. In the final model interpretation, the regression coefficients of all categories of dummy variables, Both represent the influence of the dummy variable on the dependent variable compared with the reference.

When setting dummy variables, which type should be selected as a reference?

(1) In general, you can select the category with specific meaning or certain order level as the reference
For example, marital status is divided into unmarried, married, divorced, widowed and other situations, and "unmarried" can be used as a reference; Or, for example, education is divided into primary school, middle school, University, graduate and other categories, which has a certain order. The "primary school" can be used as a reference to make the regression coefficient easier to explain.

(2) The clinical normal level can be selected as a reference
For example, BMI is divided into underweight, normal weight, overweight, obesity and other categories according to clinical diagnostic criteria. At this time, you can select "normal weight" as the reference. Other categories are compared with normal weight, which has more clinical practical significance.

(3) The key categories concerned by researchers can also be used as a reference
For example, blood types are divided into four types: A, B, O and AB. researchers pay more attention to people with type O blood. Therefore, type O can be used as a reference to analyze the differences in the impact of other blood types on the outcome compared with type o.

1.4 precautions when setting dummy variables

(1) In principle, dummy variables should be in and out of the same model, that is, in a model, if there are different dummy variables of the same classification variable, some dummy variables have statistical significance and some have no statistical significance, in order to ensure the correctness of the meaning of all dummy variables, all dummy variables should be included in the model.

(2) When selecting the reference group of dummy variables, it should be noted that the group selected as the reference should have a certain sample size. If the sample size of the reference group is too small, the standard error of parameter estimation will be larger, the confidence interval will be larger, the accuracy will be reduced, and the phenomenon of maximum or minimum estimation parameters will appear when other classifications are compared with the reference.

2. Setting of dummy variables in R language

When modeling data including classified variables in R language, it is generally automatically processed as virtual variables or dummy variables. However, some special functions, such as the neuralnet function in the neuralnet package, will not be preprocessed. If the original data is directly thrown in, there will be an error that "requires numeric/complex matrix/vector arguments" requires numeric / complex matrix / vector parameters.

At this time, in addition to deleting these variables, we can only manually convert factor variable into virtual variable with value (0,1). The functions used are generally model Matrix(), class in net package ind().

2.1 example data

Let's take UCI's german credit data as an example.

# Download from UCI website to German Data dataset
data <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data")
head(data)
##  V1 V2  V3  V4   V5  V6  V7 V8  V9  V10 V11  V12 V13  V14  V15 V16  V17 V18  V19  V20 V21
## A11  6 A34 A43 1169 A65 A75  4 A93 A101   4 A121  67 A143 A152   2 A173   1 A192 A201   1
## A12 48 A32 A43 5951 A61 A73  2 A92 A101   2 A121  22 A143 A152   1 A173   1 A191 A201   2
## A14 12 A34 A46 2096 A61 A74  2 A93 A101   3 A121  49 A143 A152   1 A172   2 A191 A201   1
## A11 42 A32 A42 7882 A61 A74  2 A93 A103   4 A122  45 A143 A153   1 A173   2 A191 A201   1
## A11 24 A33 A40 4870 A61 A73  3 A93 A101   4 A124  53 A143 A153   2 A173   2 A191 A201   2
## A14 36 A32 A46 9055 A65 A73  2 A93 A101   4 A124  35 A143 A153   1 A172   2 A192 A201   1
str(data)
## 'data.frame':    1000 obs. of  21 variables:
##  $ V1 : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
##  $ V2 : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ V3 : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
##  $ V4 : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
##  $ V5 : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ V6 : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
##  $ V7 : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
##  $ V8 : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ V9 : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
##  $ V10: Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
##  $ V11: int  4 2 3 4 4 4 4 2 4 2 ...
##  $ V12: Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
##  $ V13: int  67 22 49 45 53 35 53 35 61 28 ...
##  $ V14: Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ V15: Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
##  $ V16: int  2 1 1 1 2 1 1 1 1 2 ...
##  $ V17: Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
##  $ V18: int  1 1 2 2 2 2 1 1 1 1 ...
##  $ V19: Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
##  $ V20: Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
##  $ V21: int  1 2 1 1 2 1 1 1 1 2 ...

The data has 21 variables, of which V21 is the target variable, and V1-V20 includes integer and factor. Next, V1 classification variables (including 4 level s) and V2, V5 and V8 numerical variables will be used as explanatory variables for modeling.
First, load the neuralnet package and try it. Only numerical variables are used for modeling without error.

library("neuralnet")
NNModelAllNum <- neuralnet(V21 ~ V2 + V5 + V8, data)
NNModelAllNum
## Call: neuralnet(formula = V21 ~ V2 + V5 + V8, data = data)
## 
## 1 repetition was calculated.
## 
##         Error Reached Threshold Steps
## 1 104.9993578    0.005128715177    55

When we put V1 into the explanatory variable, the following error occurs:

NNModel <- neuralnet(V21 ~ V1 + V2 + V5 + V8, data)
## Error in neurons [[i]]% *% weights [[i]]: requires numeric / complex matrix / vector parameters

2.2 four methods of dummy variable setting

# (1) stats::model.matrix()  #Construct Design Matrices
# Convert V1 into three dummy variables, v1a12, v1a13 and v1a14
dummyV1 <- model.matrix(~V1, data)
head(cbind(dummyV1, data$V1))
##   (Intercept) V1A12 V1A13 V1A14  
## 1           1     0     0     0 1
## 2           1     1     0     0 2
## 3           1     0     0     1 4
## 4           1     0     0     0 1
## 5           1     0     0     0 1
## 6           1     0     0     1 4

# Because model The matrix function has no effect on numerical and 2-level category variables, so the four variables can be used together to generate a new data set modelData, and then the data set can be used for modeling.
modelData <- model.matrix(~V1 + V2 + V5 + V8 + V21, data)
head(modelData)
##   (Intercept) V1A12 V1A13 V1A14 V2   V5 V8 V21
## 1           1     0     0     0  6 1169  4   1
## 2           1     1     0     0 48 5951  2   2
## 3           1     0     0     1 12 2096  2   1
## 4           1     0     0     0 42 7882  2   1
## 5           1     0     0     0 24 4870  3   2
## 6           1     0     0     1 36 9055  2   1

# (2) nnet::class.ind()  #Generates Class Indicator Matrix from a Factor
library("nnet")
dummyV12 <- class.ind(data$V1)
head(cbind(dummyV12, data$V1))
#      A11 A12 A13 A14  
# [1,]   1   0   0   0 1
# [2,]   0   1   0   0 2
# [3,]   0   0   0   1 4
# [4,]   1   0   0   0 1
# [5,]   1   0   0   0 1
# [6,]   0   0   0   1 4
# Note: this result is consistent with model Matrix is slightly different and generates four dummy variables. In order to avoid multicollinearity, for the classification variable with level=n, only any n-1 dummy variables need to be selected.

# (3) caret::dummyVars()  #Create A Full Set of Dummy Variables
library("caret")
# Dummy variable processing of V1 variable by dummyVars function
a <- dummyVars(~V1,data)
dummyV12 <- predict(a,data)
head(cbind(dummyV12, data$V1))
#   V1.A11 V1.A12 V1.A13 V1.A14  
# 1      1      0      0      0 1
# 2      0      1      0      0 2
# 3      0      0      0      1 4
# 4      1      0      0      0 1
# 5      1      0      0      0 1
# 6      0      0      0      1 4

# Dummy quantization of the entire data (categorical variables in)
a <- dummyVars(~.,data)
b <- predict(a,data)
b[1:6,1:10]
#   V1.A11 V1.A12 V1.A13 V1.A14 V2 V3.A30 V3.A31 V3.A32 V3.A33 V3.A34
# 1      1      0      0      0  6      0      0      0      0      1
# 2      0      1      0      0 48      0      0      1      0      0
# 3      0      0      0      1 12      0      0      0      0      1
# 4      1      0      0      0 42      0      0      1      0      0
# 5      1      0      0      0 24      0      0      0      1      0
# 6      0      0      0      1 36      0      0      1      0      0


# (4) dummies::dummy()  #Flexible, efficient creation of dummy variables
library(dummies)
dummyV12 <- dummy(data$V1, sep = ".")
head(cbind(dummyV12,data$V1))
#      V1.A11 V1.A12 V1.A13 V1.A14  
# [1,]      1      0      0      0 1
# [2,]      0      1      0      0 2
# [3,]      0      0      0      1 4
# [4,]      1      0      0      0 1
# [5,]      1      0      0      0 1
# [6,]      0      0      0      1 4

2.3 small examples of linear regression

As mentioned above, in fact, when R language models the data including classified variables (factors), it will generally automatically process them as virtual variables or dummy variables. Here's a simple example.

library(tidyverse)
library(car)

# Load the data
data("Salaries", package = "carData")
str(Salaries)
# 'data.frame':	397 obs. of  6 variables:
#   $ rank         : Factor w/ 3 levels "AsstProf","AssocProf",..: 3 3 1 3 3 2 3 3 3 3 ...
# $ discipline   : Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 2 2 2 ...
# $ yrs.since.phd: int  19 20 4 45 40 6 30 45 21 18 ...
# $ yrs.service  : int  18 16 3 39 41 6 23 45 20 18 ...
# $ sex          : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 2 ...
# $ salary       : int  139750 173200 79750 115000 141500 97000 175000 147765 119250 129000 ...
sample_n(Salaries, 3)  # Inspect the data

# R language takes the first level of factor as the reference group by default
# Binary variable
contrasts(Salaries$sex)
#           Male        
# Female      0
# Male        1
model <- lm(salary ~ sex, data = Salaries)
summary(model)$coef
#             Estimate Std. Error   t value     Pr(>|t|)
# (Intercept) 101002.41   4809.386 21.001103 2.683482e-66
# sexMale      14088.01   5064.579  2.781674 5.667107e-03

# Modify reference group
Salaries <- Salaries %>% mutate(sex = relevel(sex, ref = "Male"))
contrasts(Salaries$sex)
#        Female
# Male        0
# Female      1

model <- lm(salary ~ sex, data = Salaries)
summary(model)$coef
#              Estimate Std. Error   t value      Pr(>|t|)
# (Intercept) 115090.42   1587.378 72.503463 2.459122e-230
# sexFemale   -14088.01   5064.579 -2.781674  5.667107e-03

# Multi classification variable
res <- model.matrix(~rank, data = Salaries)
head(cbind(res, Salaries$rank))
#   (Intercept) rankAssocProf rankProf  
# 1           1             0        1 3
# 2           1             0        1 3
# 3           1             0        0 1
# 4           1             0        1 3
# 5           1             0        1 3
# 6           1             1        0 2

model2 <- lm(salary ~ yrs.service + res[,-1] + discipline + sex, data = Salaries)
summary(model2)

model3 <- lm(salary ~ yrs.service + rank + discipline + sex, data = Salaries)
summary(model3)
Anova(model3)
head(model.matrix(model3))  #Check the design matrix of the model. The value of rank is consistent with the dummy variable set in head (cbind (RES, sales $rank))
#   (Intercept) yrs.service rankAssocProf rankProf disciplineB sexFemale
# 1           1          18             0        1           1         0
# 2           1          16             0        1           1         0
# 3           1           3             0        0           1         0
# 4           1          39             0        1           1         0
# 5           1          41             0        1           1         0
# 6           1           6             1        0           1         0

Reference reading:
Medical Coffee Club SPSS tutorial: teach you how to set dummy variables and interpret the results!
Generating dummy / dummy variables in R language
Regression with Categorical Variables: Dummy Coding Essentials in R

Posted by ilikemath2002 on Sat, 16 Apr 2022 00:40:32 +0930