Probability distributions

A critical aspect of (parametric) statistical analysis is the use of probability distributions, like the normal (Gaussian) distribution. These distributions underly all of our common (parametric) statistical tests, like t-tests, chi-squared tests, ANOVA, regression, and so forth. R has functions to draw values from all of the common distributions (normal, t, F, chi-squared, binomial, poisson, etc.), as well as many others.

There are four families of functions that R implements uniformly across each of these distributions that enable users to extract the probability density, the cumulative density, and the quantiles of a distribution. For example, the dnorm function provides the density of the normal distribution at a specific quantile. The pnorm function provides the cumulative density of the normal distribution at a specific quantile. The qnorm function provides the quantile of the normal distribution at a specified cumulative density. An additional function, rnorm, draws random values from the normal distribution (but this is discussed in detail in the random sampling tutorial).

The same functions are also implemented for the other common distributions. For example, the functions for Student's t distribution are dt, pt, and qt. For the chi-squared distribution, they are dchisq, pchisq, and qchisq. Hopefully you see the pattern. The rest of this tutorial walks through how to use these functions.

Density functions

The density functions provide the density of a specified distribution at a given quantile. This means that the d* family of functions can extract the density not just from a given distribution but from any version of the distribution. For example, calling:

dnorm(0)

## [1] 0.3989

provides the density of the standard normal distribution (i.e., a normal distribution with mean 0 and standard deviation 1) at the point 0 (i.e., at the distribution's mean). We can retrieve the density at a different value (or vector of values) easily:

dnorm(1)

## [1] 0.242

dnorm(1:3)

## [1] 0.241971 0.053991 0.004432

We can also retrieve densities from a different normal distribution (e.g., one with a higher mean or larger SD):

dnorm(1, mean = -2)

## [1] 0.004432

dnorm(1, mean = 5)

## [1] 0.0001338

dnorm(1, mean = 0, sd = 3)

## [1] 0.1258

Cumulative distribution functions

We are often much more interested in the cumulative distribution (i.e., how much of the distribution is to the left of the indicated value). For this, we can use the p* family of functions. As an example, let's obtain the cumulative distribution function's value from a standard normal distribution at point 0 (i.e., the distribution's means):

pnorm(0)

## [1] 0.5

Unsurprisingly, the value is .5 because half of the distribution is to the left of 0.

When we conduct statistical significance testing, we compare a value we observe to the cumulative distribution function. As one might recall, the value of 1.65 is the (approximate) critical value for a 90% normal confidence interval. We can see that by requesting:

pnorm(1.65)

## [1] 0.9505

The comparable value for a 95% CI is 1.96:

pnorm(1.96)

## [1] 0.975

Note how the values are ~.95 and ~.975, respectively, because those are critical values for two-tailed tests. If we plug a negative value into the pnorm function, we'll receive the cumulative probability for the left side of the distribution:

pnorm(-1.96)

## [1] 0.025

Thus subtracting the output of pnorm for the negative input from the output for the positive input, we'll see that 95% of the density is between -1.96 and 1.96 (in the standard normal distribution):

pnorm(1.96) - pnorm(-1.96)

## [1] 0.95

Quantile function

The examples just described relied on the heuristic values of 1.65 and 1.96 as the thresholds for 90% and 95% two-tailed tests. But to find the exact points at which the normal distribution has accumulated a particular cumulative density, we can use the qnorm function. Essentially, qnorm is the reverse of pnorm. To obtain the critical values for a two-tailed 95% confidence interval, we would plug .025 and .975 into qnorm:

qnorm(c(0.025, 0.975))

## [1] -1.96  1.96

And we could actually nest that call inside a pnorm function to see that pnorm and qnorm are opposites:

pnorm(qnorm(c(0.025, 0.975)))

## [1] 0.025 0.975

For one-tailed tests, we simply specify the cumulative density. So, for a one-tailed 95% critical value, we would specify:

qnorm(0.95)

## [1] 1.645

We could obtain the other tail by specifying:

qnorm(0.05)

## [1] -1.645

Or, we could request the upper-tail of the distribution rather than the lower (left) tail (which is the default):

qnorm(0.95, lower.tail = FALSE)

## [1] -1.645

As with dnorm, pnorm, and qnorm work on arbitrary normal distributions, but its results will be unfamiliar to us:

pnorm(1.96, mean = 3)

## [1] 0.1492

qnorm(0.95, mean = 3)

## [1] 4.645

Other distributions

As stated above, R supplies functions analogous to those just described for numerous distributions. Details about all of the distributions can be found in the help files: ? Distributions.

Here are a few examples:

t distribution Note: The t distribution functions require a df argument, specifying the degrees of freedom.

qt(0.95, df = 1000)

## [1] 1.646

qt(c(0.025, 0.975), df = 1000)

## [1] -1.962  1.962

Binomial distribution The binomial distribution functions work as above, but require size and prob arguments, specifying the number of draw and the probability of success. So, if we are modelling fair coin flips:

dbinom(0, 1, 0.5)

## [1] 0.5

pbinom(0, 1, 0.5)

## [1] 0.5

qbinom(0.95, 1, 0.5)

## [1] 1

qbinom(c(0.025, 0.975), 1, 0.5)

## [1] 0 1

Note: Because the binomial is a discrete distribution, the values here might seem strange compared to the above.