Note link
Learning notes 1-R language foundation.
Study note 2 - Advanced Data Structures.
Study note 3 - reading data in R language.
Study note 4 - statistical chart.
Learning note 5 - write R language functions and simple control loop statements.
Learning Notes 6 - group operation.
Learning notes 7 - efficient grouping operation: dplyr.
Learning note 8 - Data iteration.
Study notes 9 - data collation.
Learning note 10 - data reconstruction: Tidyverse.
Learning note 11 - string operation.
Study note 12 - probability distribution
R language is a statistical programming language, so it is easy to deal with some statistical problems, in which probability distribution occupies the core position in statistics.
12.1 normal distribution
Normal distribution, also known as Gaussian distribution, is defined as:
μ Is the mean, σ Is the standard deviation.
We use the rnorm function to extract random numbers that obey the normal distribution, and set the mean and standard deviation at the same time
Example:
> rnorm(n = 10, mean = 100, sd = 20) [1] 110.32908 107.67034 110.30984 85.56377 88.75340 [6] 86.68074 94.36524 118.92754 107.87610 132.73329
dnorm is used to calculate the probability density of normal distribution
It returns the probability of a specific value
Example:
> randNorm10 <- rnorm(10) > randNorm10 [1] -0.007795891 0.915999325 0.513491417 -0.065681391 [5] 1.166448550 0.916629337 -0.745279887 0.517869510 [9] 0.337601316 -0.625649967 > dnorm(randNorm10) [1] 0.3989302 0.2622477 0.3496666 0.3980827 0.2020501 [6] 0.2620963 0.3022020 0.3488780 0.3768433 0.3280276
Next, we generate some variables and plot them
> randNorm <- rnorm(30000) > randDensity <- dnorm(randNorm) > library(ggplot2) > ggplot(data.frame(x=randNorm, y=randDensity)) + aes(x=x, y=y) + geom_point() + labs(x="Random Normal Variables", y="Density")
The pnorm function calculates the distribution of the distribution, that is, the cumulative probability and the area under the curve
Example:
> pnorm(randNorm10) [1] 0.002573215 0.462999528 0.944438605 0.997887002 [5] 0.497449393 0.255477128 0.117798244 0.136126932 [9] 0.903424405 0.989672991
The default is left probability.
The probability of falling between two values can be calculated by subtracting two probabilities
Example:
> pnorm(1) - pnorm(0) [1] 0.3413447
Draw graphics:
First, draw the normal distribution layer
> p <- ggplot(data.frame(x=randNorm, y=randDensity)) + aes(x=x, y=y) + geom_line() + labs(x="x", y="Density")
Then, the seq function is used to generate a number with a growth rate of 0.1 from the minimum value of random number randNorm to - 1. (seq function indicates generating a set of numbers from from to)
> neg1Seq <- seq(from=min(randNorm), to=-1, by=.1)
After that, a data frame is formed. x is neg1Seq and y is randDensity
> lessThanNeg1 <- data.frame(x=neg1Seq, y=dnorm(neg1Seq)) > head(lessThanNeg1) x y 1 -3.711379 0.0004072403 2 -3.611379 0.0005873033 3 -3.511379 0.0008385543 4 -3.411379 0.0011853783 5 -3.311379 0.0016589750 6 -3.211379 0.0022986865
Next, merge rows to determine the range of data
> lessThanNeg1 <- rbind(c(min(randNorm), 0), + lessThanNeg1, + c(max(lessThanNeg1$x), 0))
Then use geom_polygon to fill
> p + geom_polygon(data=lessThanNeg1, aes(x=x, y=y))
This is the probability of one side. Similarly, we can draw the graph of the probability of falling between two values
> neg1Pos1Seq <- seq(from=-1, to=1, by=.1) > neg1To1 <- data.frame(x=neg1Pos1Seq, y=dnorm(neg1Pos1Seq)) > head(neg1To1) x y 1 -1.0 0.2419707 2 -0.9 0.2660852 3 -0.8 0.2896916 4 -0.7 0.3122539 5 -0.6 0.3332246 6 -0.5 0.3520653 > neg1To1 <- rbind(c(min(neg1To1$x), 0), + neg1To1, + c(max(neg1To1$x), 0)) > p + geom_polygon(data=neg1To1, aes(x=x, y=y))
At the same time, we can also use the previous random data to draw the standard distribution function
> randProb <- pnorm(randNorm) > ggplot(data.frame(x=randNorm, y=randProb)) + aes(x=x, y=y) + geom_point() + labs(x="Random Normal Variables", y="Probability")
The inverse function of pnorm function is qnorm, that is, input probability and return quantile
Example:
> randNorm10 [1] -2.797722545 -0.092879796 1.593166745 2.860780500 [5] -0.006393468 -0.657352548 -1.186065361 -1.097886934 [9] 1.301313435 2.314249912 > qnorm(pnorm(randNorm10)) [1] -2.797722545 -0.092879796 1.593166745 2.860780500 [5] -0.006393468 -0.657352548 -1.186065361 -1.097886934 [9] 1.301313435 2.314249912
12.2 binomial distribution
Binomial distribution function formula:
Where n is the number of experiments, P is the probability of experiment success, the mean value is np and the variance is np(1-p)
We use rbinom function to generate random numbers of binomial distribution
Example:
Ten experiments are conducted, and the success probability of each experiment is 0.4. The whole process is run ten times, and the number of successful experiments in each process is returned
> rbinom(n=10, size=10, prob = .4) [1] 5 2 4 4 5 4 3 5 4 5
Where size is the number of experiments, prob is the probability of success, and n is the number of runs
Next, the binomial distribution is visualized
Example:
10000 experiments are randomly generated, the size of each experiment is 10, and the success probability is 0.3
> binomData <- data.frame(Success=rbinom(n=10000, size=10, prob=.3)) > ggplot(binomData, aes(x=Success)) + geom_histogram(binwidth = 1)
You can see that the number of successes is 3, with the most occurrences
Similar to the normal distribution function, dbinom and pbinom return the density (exact probability value) and distribution (cumulative probability) of binomial distribution respectively
12.3 Poisson distribution
Poisson distribution probability mass function:
λ Both mean and variance
The functions that generate random number, density, distribution and quantile are rpois, dpois, ppois and qpois respectively
along with λ With the increase of, the Poisson distribution begins to be similar to the normal distribution.
Example:
Simulate 10000 samples from Poisson distribution and draw its histogram
> pois1 <- rpois(n=10000, lambda = 1) > pois2 <- rpois(n=10000, lambda = 2) > pois5 <- rpois(n=10000, lambda = 5) > pois10 <- rpois(n=10000, lambda = 10) > pois20 <- rpois(n=10000, lambda = 20) > pois <- data.frame(Lambda.1=pois1, Lambda.2=pois2, Lambda.5=pois5, Lambda.10=pois10, Lambda.20=pois20) > library(reshape2) > pois <- melt(data=pois, variable.name = "Lambda", value.name = "x") > library(stringr) > pois$Lambda <- as.factor(as.numeric(str_extract(string=pois$Lambda, pattern = "\\d+"))) > head(pois) Lambda x 1 1 0 2 1 0 3 1 0 4 1 2 5 1 2 6 1 1 > library(ggplot2) > ggplot(pois, aes(x=x)) + geom_histogram(binwidth = 1) + facet_wrap(~ Lambda) + ggtitle("Probability Mass Function")
It can be seen that the histogram tends to be normal distribution
We can also observe its density function
> ggplot(pois, aes(x=x)) + + geom_density(aes(group=Lambda, color=Lambda, + fill=Lambda), + adjust=4, alpha=1/2) + + scale_color_discrete() + scale_fill_discrete() + + ggtitle("Probability Mass Function")