R language learning 3: data frame processing

This series is a new series. In this series, I will learn R language with you. Since I know little about R language, this series is more completed from the perspective of a learner.

Reference textbook: Robert I.Kabacoff, the second edition of R language practice, which is mentioned in the book John Cook's excellent blog post , about code specification< R language coding style guide from Google>.

catalogue

Part 1: variable management

Section 1: create variable -- transform()

To create a new variable for the data frame, you can use the $operator of the data frame. If the variable in df$variable does not belong to the original data frame, a new data frame will be created.

mydata <- data.frame(
  x1 = c(2, 2, 6, 4),
  x2 = c(3, 4, 2, 8)
)
mydata$sumx <- mydata$x1 + mydata$x2
mydata$meanx <- mydata$sumx / 2

Using the transform() function, you can more easily create new variables. Its usage is:

mydata <- data.frame(
  x1 = c(2, 2, 6, 4),
  x2 = c(3, 4, 2, 8)
)
mydata <- transform(mydata, 
                    sumx = x1 + x2,
                    meanx = (x1 + x2)/2)

In addition, the transform() function can easily modify the data frame to generate a modified image.

Section 2: recoding of variables -- within()

Recoding of variables refers to the process of creating new values based on the existing values of variables, such as sorting, correcting wrong values, etc.

In order to facilitate the demonstration of the following contents, use the original data frame given in the book:

manager <- c(1, 2, 3, 4, 5)
date <- c("10/24/14", "10/28/14", "10/01/14", "10/12/14", "05/01/14")
country <- c("US", "US", "UK", "UK", "UK")
gender <- c("M", "F", "F", "M", "F")
age <- c(32, 45, 25, 39, 99)
q1 <- c(5, 3, 3, 3, 2)
q2 <- c(4, 5, 5, 3, 2)
q3 <- c(5, 2, 5, 4, 1)
q4 <- c(5, 5, 5, NA, 2)
q5 <- c(5, 5, 2, NA, 1)
leadership <- data.frame(manager, date, country, gender, age,
                         q1, q2, q3, q4, q5, stringsAsFactors = F)

Conditional assignment statement can assign value to the corresponding position of the vector when the condition is TRUE. Its usage is

variable[condition] <- expression  # Here condition is a condition vector

Combined with the within() function, you can easily perform recoding for the data frame. The syntax of the within() function is similar to that of with(), but with() only allows you to easily call variables in the data frame, and within() also allows you to modify the data frame. For example, perform recoding for agecat in the leadership data frame:

leadership$age[leadership$age == 99] <- NA # Encode outliers as NA

leadership <- within(leadership,{
  agecat <- NA
  agecat[age > 75] <- "Elder"
  agecat[age <= 75 & age >= 55] <- "Middle Aged"
  agecat[age < 55] <- "Young"
})

Section 3: variable renaming - names()

You can change the name of the variable interactively with function (fix), or you can change the name of the function unsatisfactorily with function (fix). as

names(leadership)[2] <- "testDate"
names(leadership)[6:10] <- c("item1", "item2", "item3", "item4", "item5")

Or use the rename function in the plyr package to modify the variable name.

leadership <- rename(leadership, c(manager="managerID"))

Part 2: value processing

Section 1: missing value

The missing value is represented by the symbol NA, which means Not Available. It is not comparable (even x == NA, only is.na(x)). In R language, missing values are different from Inf (positive infinity), - Inf (negative infinity) and NaN (Not a Number, Not a Number). The above are their symbols.

It should be noted that arithmetic expressions and functions with missing values are also missing values. Multivalued functions generally have Na RM parameter. If you want to avoid the influence of missing value during calculation, you need to specify Na rm=T.

> x <- c(1, 2, NA, 3)
> sum(x)
[1] NA
> sum(x, na.rm=T)
[1] 6

For the data frame, if the missing values are only concentrated in a small part of the observations, the row deletion method can be used. Specifically, Na Omit() function to delete all observations with missing values.

newdata <- na.omit(leadership)  # The function generates an image and will not affect the original data frame
newdata

Section 2: date value - as Date()

To get the current date, use sys date() function; To get the current date and time, you can use the date () function.

> date()
[1] "Fri Feb 19 13:56:35 2021"
> Sys.Date()
[1] "2021-02-19"

In R language, the date value is often entered as a string, and then you need to use as The date() function converts the date stored in numerical form. Its syntax is

as.Date(x, "input_format")

Here, input_format is the appropriate format for reading in the date. The default is yyyy MM DD (year, month and day). Other formatting symbols are shown in the table below.

For this example, the time format is mm/dd/yy, so the following modification code should be used:

myformat <- "%m/%d/%y"  # The encoding is stored as a string
leadership$date <- as.Date(leadership$date, myformat)

To change the format of the date for output, you can use the format() function, which uses the format

format(x, format="output_format")
---
> m <- Sys.Date()
> format(m, format="%Y/%m/%d")
[1] "2021/02/19"

To calculate the difference between dates, you can use the difftime() function in the format

difftime(date1, date2, units)

Here, units is the time unit. You can use "auto", "secs", "mins", "days", "weeks" and so on.

with(leadership, {
  difftime(date[2], date[1], units = "days")
})

Time difference of 4 days

Finally, to convert the date to a string, use as Character() function.

Section 3: type conversion

To determine whether a value is a type, use is Datatype() function; To convert a numeric value to a type, use as Datatype() function.

Where is Datatype() returns a TRUE or FALSE, which can be used to control the flow.

Part 3: data frame processing

Section 1: sorting

Use the order() function to sort the vectors. order() will recode the vectors and accept several vectors of equal length as the ordered keywords for sorting.

> x <- c("a", "e", "d", "d", "c", "b")
> y <- c(6, 5, 4, 3, 2, 1)

> order(x)
[1] 1 6 5 3 4 2
> order(x, y)  # y as the second ranking criterion
[1] 1 6 5 4 3 2

For data frames, you can also use the order() function as a conditional sort. The principle is that data frames can be sorted according to an ordered vector.

with(leadership, {
  newdata2 <<- leadership[order(gender, -age), 1:5]
})

> newdata2

  managerID       date country gender age
2         2 2014-10-28      US      F  45
3         3 2014-10-01      UK      F  25
5         5 2014-05-01      UK      F  NA
4         4 2014-10-12      UK      M  39
1         1 2014-10-24      US      M  32

Note that if you want to render a complete data frame, the comma in [contidion,] cannot be omitted.

Section 2: data frame merging

Horizontal merge: that is to realize the internal connection of two data frames. Through one or more common variables, you can use the merge() function. Its usage is

total <- merge(dataframeA, dataframeB, by="ID")  # Specify common fields by

If you don't need to consider foreign keys and simply merge horizontally, you can use the cbind() function, which requires each object to have the same number of rows and sort in the same order.

Vertical merge: that is, add observations to the data frame. You can use the rbind() function, which requires that the two data frames have the same number of variables (there is no requirement for order).

total <- rbind(dataframeA, dataframeB)

If the variables in the two data frames are different, preprocessing is required: either delete the redundant variables in dataframe a, or create an additional variable in dataframe B and set its value to NA.

Section 3: data frame subset

Select variable: use dataframe[, colindex], where colindex is the index of the variable. In fact, if you need to keep all observations, you can directly use dataframe[colindex] instead of commas. In addition, column indexes can also use variable names, such as

myvars <- c("q1", "q2", "q3", "q4", "q5")
newdata <- leadership[myvars]

> newdata

  q1 q2 q3 q4 q5
1  5  4  5  5  5
2  3  5  2  5  5
3  3  5  5  5  2
4  3  3  4 NA NA
5  2  2  1  2  1

Delete variable: you can use the following statement to delete q4 and q5 variables.

myvars <- names(leadership) %in% c("q4", "q5")
newdata <- leadership[!myvars]

Here, myvars is a logical variable. Except that q4 and q5 are T, all other places are F. therefore, taking the inverse indicates that all other places are T (to be retained) except q4 and q5 are F.

In addition, if you know that q4 and q5 are the 9th and 10th variables, you can simply use them

newdata <- leadership[c(-9, -10)]

Delete.

Selected observations: logical variables can also be used for selected observations. Now, select the observation value between January 1, 2009 and October 20, 2014, which is here.

startdate <- as.Date("2009-01-01")
enddate <- as.Date("2014-10-20")
newdata <- leadership[leadership$date >= startdate & leadership$date <= enddate, ]  # Commas must be reserved

The simplest way to select a subset of the data frame is to use the subset() function, which can complete the above functions at one time. Its format is as follows:

subset(x, subset, select)

Here, x is the data frame to be passed in, subset is the logical vector to keep the observation, and select is the vector to keep the variable.

newdata <- subset(
    leadership, gender == "M" & age > 25,
    select = gender:q4
)

> newdata
  gender age q1 q2 q3 q4
1      M  32  5  4  5  5
4      M  39  3  3  4 NA

It can be seen that the vector of reserved variables can be directly expressed by from:to, and from and to do not have to be numeric values.

To randomly select observations, you can use the sample() function, which is

sample(x, size, replace=FALSE)

Here, x represents the vector composed of sampling elements, size represents the number of elements to be extracted, and replace represents whether there is a return sampling. If sampling from a data set, it can be used as follows:

newsample <- leadership[sample(1:nrow(leadership), 3, replace=F), ]

Section 4: using SQL query

If it is a large data set, you can use SQL statements to find it, which depends on the sqldf function in the sqldf library.

library(sqldf)
attach(mtcars)
search <- "SELECT * FROM mtcars WHERE carb=1 ORDER BY mpg"
newdf <- sqldf(search, row.names=T)

There are also many optional parameters in sqldf, such as stringsAsFactors and row Names et al.

Posted by Spreegem on Mon, 18 Apr 2022 01:25:59 +0930