R language beginner learning notes 12 - probability distribution

Note link

Learning notes 1-R language foundation.
Study note 2 - Advanced Data Structures.
Study note 3 - reading data in R language.
Study note 4 - statistical chart.
Learning note 5 - write R language functions and simple control loop statements.
Learning Notes 6 - group operation.
Learning notes 7 - efficient grouping operation: dplyr.
Learning note 8 - Data iteration.
Study notes 9 - data collation.
Learning note 10 - data reconstruction: Tidyverse.
Learning note 11 - string operation.

Study note 12 - probability distribution

R language is a statistical programming language, so it is easy to deal with some statistical problems, in which probability distribution occupies the core position in statistics.

12.1 normal distribution

Normal distribution, also known as Gaussian distribution, is defined as:

μ Is the mean, σ Is the standard deviation.

We use the rnorm function to extract random numbers that obey the normal distribution, and set the mean and standard deviation at the same time

Example:

> rnorm(n = 10, mean = 100, sd = 20)
 [1] 110.32908 107.67034 110.30984  85.56377  88.75340
 [6]  86.68074  94.36524 118.92754 107.87610 132.73329

dnorm is used to calculate the probability density of normal distribution

It returns the probability of a specific value

Example:

> randNorm10 <- rnorm(10)
> randNorm10 
 [1] -0.007795891  0.915999325  0.513491417 -0.065681391
 [5]  1.166448550  0.916629337 -0.745279887  0.517869510
 [9]  0.337601316 -0.625649967
> dnorm(randNorm10)
 [1] 0.3989302 0.2622477 0.3496666 0.3980827 0.2020501
 [6] 0.2620963 0.3022020 0.3488780 0.3768433 0.3280276

Next, we generate some variables and plot them

> randNorm <- rnorm(30000)
> randDensity <- dnorm(randNorm)
> library(ggplot2)
> ggplot(data.frame(x=randNorm, y=randDensity)) + aes(x=x, y=y) + geom_point() + labs(x="Random Normal Variables", y="Density")


The pnorm function calculates the distribution of the distribution, that is, the cumulative probability and the area under the curve

Example:

> pnorm(randNorm10)
 [1] 0.002573215 0.462999528 0.944438605 0.997887002
 [5] 0.497449393 0.255477128 0.117798244 0.136126932
 [9] 0.903424405 0.989672991

The default is left probability.

The probability of falling between two values can be calculated by subtracting two probabilities

Example:

> pnorm(1) - pnorm(0)
[1] 0.3413447

Draw graphics:

First, draw the normal distribution layer

> p <- ggplot(data.frame(x=randNorm, y=randDensity)) + aes(x=x, y=y) + geom_line() + labs(x="x", y="Density")

Then, the seq function is used to generate a number with a growth rate of 0.1 from the minimum value of random number randNorm to - 1. (seq function indicates generating a set of numbers from from to)

> neg1Seq <- seq(from=min(randNorm), to=-1, by=.1)

After that, a data frame is formed. x is neg1Seq and y is randDensity

> lessThanNeg1 <- data.frame(x=neg1Seq, y=dnorm(neg1Seq))
> head(lessThanNeg1)
          x            y
1 -3.711379 0.0004072403
2 -3.611379 0.0005873033
3 -3.511379 0.0008385543
4 -3.411379 0.0011853783
5 -3.311379 0.0016589750
6 -3.211379 0.0022986865

Next, merge rows to determine the range of data

> lessThanNeg1 <- rbind(c(min(randNorm), 0), 
+                       lessThanNeg1,
+                       c(max(lessThanNeg1$x), 0))

Then use geom_polygon to fill

> p + geom_polygon(data=lessThanNeg1, aes(x=x, y=y))


This is the probability of one side. Similarly, we can draw the graph of the probability of falling between two values

> neg1Pos1Seq <- seq(from=-1, to=1, by=.1)
> neg1To1 <- data.frame(x=neg1Pos1Seq, y=dnorm(neg1Pos1Seq))
> head(neg1To1)
     x         y
1 -1.0 0.2419707
2 -0.9 0.2660852
3 -0.8 0.2896916
4 -0.7 0.3122539
5 -0.6 0.3332246
6 -0.5 0.3520653
> neg1To1 <- rbind(c(min(neg1To1$x), 0),
+                  neg1To1,
+                  c(max(neg1To1$x), 0))
> p + geom_polygon(data=neg1To1, aes(x=x, y=y))

At the same time, we can also use the previous random data to draw the standard distribution function

> randProb <- pnorm(randNorm)
> ggplot(data.frame(x=randNorm, y=randProb)) + aes(x=x, y=y) + geom_point() + labs(x="Random Normal Variables", y="Probability")


The inverse function of pnorm function is qnorm, that is, input probability and return quantile

Example:

> randNorm10
 [1] -2.797722545 -0.092879796  1.593166745  2.860780500
 [5] -0.006393468 -0.657352548 -1.186065361 -1.097886934
 [9]  1.301313435  2.314249912
> qnorm(pnorm(randNorm10))
 [1] -2.797722545 -0.092879796  1.593166745  2.860780500
 [5] -0.006393468 -0.657352548 -1.186065361 -1.097886934
 [9]  1.301313435  2.314249912

12.2 binomial distribution

Binomial distribution function formula:

Where n is the number of experiments, P is the probability of experiment success, the mean value is np and the variance is np(1-p)

We use rbinom function to generate random numbers of binomial distribution

Example:

Ten experiments are conducted, and the success probability of each experiment is 0.4. The whole process is run ten times, and the number of successful experiments in each process is returned

> rbinom(n=10, size=10, prob = .4)
 [1] 5 2 4 4 5 4 3 5 4 5

Where size is the number of experiments, prob is the probability of success, and n is the number of runs

Next, the binomial distribution is visualized

Example:

10000 experiments are randomly generated, the size of each experiment is 10, and the success probability is 0.3

> binomData <- data.frame(Success=rbinom(n=10000, size=10, prob=.3))
> ggplot(binomData, aes(x=Success)) + geom_histogram(binwidth = 1)


You can see that the number of successes is 3, with the most occurrences

Similar to the normal distribution function, dbinom and pbinom return the density (exact probability value) and distribution (cumulative probability) of binomial distribution respectively

12.3 Poisson distribution

Poisson distribution probability mass function:

λ Both mean and variance

The functions that generate random number, density, distribution and quantile are rpois, dpois, ppois and qpois respectively

along with λ With the increase of, the Poisson distribution begins to be similar to the normal distribution.

Example:

Simulate 10000 samples from Poisson distribution and draw its histogram

> pois1 <- rpois(n=10000, lambda = 1)
> pois2 <- rpois(n=10000, lambda = 2)
> pois5 <- rpois(n=10000, lambda = 5)
> pois10 <- rpois(n=10000, lambda = 10)
> pois20 <- rpois(n=10000, lambda = 20)
> pois <- data.frame(Lambda.1=pois1, Lambda.2=pois2, Lambda.5=pois5, Lambda.10=pois10, Lambda.20=pois20)
> library(reshape2)
> pois <- melt(data=pois, variable.name = "Lambda", value.name = "x")
> library(stringr)
> pois$Lambda <- as.factor(as.numeric(str_extract(string=pois$Lambda, pattern = "\\d+")))
> head(pois)
  Lambda x
1      1 0
2      1 0
3      1 0
4      1 2
5      1 2
6      1 1
> library(ggplot2)
> ggplot(pois, aes(x=x)) + geom_histogram(binwidth = 1) + facet_wrap(~ Lambda) + ggtitle("Probability Mass Function")


It can be seen that the histogram tends to be normal distribution

We can also observe its density function

> ggplot(pois, aes(x=x)) + 
+     geom_density(aes(group=Lambda, color=Lambda, 
+                      fill=Lambda),
+                  adjust=4, alpha=1/2) + 
+     scale_color_discrete() + scale_fill_discrete() + 
+     ggtitle("Probability Mass Function")

12.4 other distribution

Tags: Big Data R Language

Posted by id10t on Mon, 18 Apr 2022 15:34:39 +0930