Basic Univariate Statistics

R is obviously a statistical programming language and environment, so we can use it to do statistics. With any vector, we can calculate a number of statistics, including:

set.seed(1)
a <- rnorm(100)

Simple statistics

mininum

min(a)
## [1] -2.215

maximum

max(a)
## [1] 2.402

We can get the minimum and maximum together with range:

range(a)
## [1] -2.215  2.402

We can also obtain the minimum by sorting the vector (using sort):

sort(a)[1]
## [1] -2.215

And we can obtain the maximum by sorting in the opposite order:

sort(a, decreasing = TRUE)[1]
## [1] 2.402

To calculate the central tendency, we have several options. mean

mean(a)
## [1] 0.1089

This is of course equivalent to:

sum(a)/length(a)
## [1] 0.1089

median

median(a)
## [1] 0.1139

In a vector with an even number of elements, this is equivalent to:

(sort(a)[length(a)/2] + sort(a)[length(a)/2 + 1])/2
## [1] 0.1139

In a vector with an odd number of elements, this is equivalent to:

a2 <- a[-1]  #' drop first observation of `a`
sort(a2)[length(a2)/2 + 1]
## [1] 0.1533

We can also obtain measures of dispersion: Variance

var(a)
## [1] 0.8068

This is equivalent to:

sum((a - mean(a))^2)/(length(a) - 1)
## [1] 0.8068

Standard deviation

sd(a)
## [1] 0.8982

Which is equivalent to:

sqrt(var(a))
## [1] 0.8982

Or:

sqrt(sum((a - mean(a))^2)/(length(a) - 1))
## [1] 0.8982

There are also some convenience functions that provide multiple statistics. The fivenum function provides the five-number summary (minimum, Q1, median, Q3, and maximum):

fivenum(a)
## [1] -2.2147 -0.5103  0.1139  0.6934  2.4016

It is also possible to obtain arbitrary percentiles/quantiles from a vector:

quantile(a, 0.1)  #' 10% quantile
##    10% 
## -1.053

You can also specify a vector of quantiles:

quantile(a, c(0.025, 0.975))
##   2.5%  97.5% 
## -1.671  1.797
quantile(a, seq(0, 1, by = 0.1))
##      0%     10%     20%     30%     40%     50%     60%     70%     80% 
## -2.2147 -1.0527 -0.6139 -0.3753 -0.0767  0.1139  0.3771  0.5812  0.7713 
##     90%    100% 
##  1.1811  2.4016

Summary

The summary function, applied to a numeric vector, provides those values and the mean:

summary(a)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -2.210  -0.494   0.114   0.109   0.692   2.400

Note: The summary function returns different results if the vector is a logical, character, or factor. For a logical vector, summary returns some tabulations:

summary(as.logical(rbinom(100, 1, 0.5)))
##    Mode   FALSE    TRUE    NA's 
## logical      62      38       0

For a character vector, summary returns just some basic information about the vector:

summary(sample(c("a", "b", "c"), 100, TRUE))
##    Length     Class      Mode 
##       100 character character

For a factor, summary returns a table of all values in the vector:

summary(factor(a))
##    -2.2146998871775   -1.98935169586337   -1.80495862889104 
##                   1                   1                   1 
##   -1.52356680042976   -1.47075238389927   -1.37705955682861 
##                   1                   1                   1 
##   -1.27659220845804    -1.2536334002391   -1.22461261489836 
##                   1                   1                   1 
##   -1.12936309608079   -1.04413462631653  -0.934097631644252 
##                   1                   1                   1 
##  -0.835628612410047  -0.820468384118015  -0.743273208882405 
##                   1                   1                   1 
##  -0.709946430921815   -0.70749515696212   -0.68875569454952 
##                   1                   1                   1 
##  -0.626453810742332  -0.621240580541804  -0.612026393250771 
##                   1                   1                   1 
##  -0.589520946188072  -0.573265414236886  -0.568668732818502 
##                   1                   1                   1 
##   -0.54252003099165   -0.47815005510862  -0.473400636439312 
##                   1                   1                   1 
##  -0.443291873218433   -0.41499456329968  -0.394289953710349 
##                   1                   1                   1 
##  -0.367221476466509  -0.305388387156356  -0.304183923634301 
##                   1                   1                   1 
##  -0.253361680136508  -0.164523596253587  -0.155795506705329 
##                   1                   1                   1 
##  -0.135178615123832  -0.135054603880824  -0.112346212150228 
##                   1                   1                   1 
##  -0.102787727342996 -0.0593133967111857 -0.0561287395290008 
##                   1                   1                   1 
## -0.0538050405829051 -0.0449336090152309 -0.0392400027331692 
##                   1                   1                   1 
## -0.0161902630989461 0.00110535163162413  0.0280021587806661 
##                   1                   1                   1 
##  0.0743413241516641  0.0745649833651906   0.153253338211898 
##                   1                   1                   1 
##   0.183643324222082   0.188792299514343   0.267098790772231 
##                   1                   1                   1 
##   0.291446235517463   0.329507771815361   0.332950371213518 
##                   1                   1                   1 
##   0.341119691424425    0.36458196213683   0.370018809916288 
##                   1                   1                   1 
##   0.387671611559369   0.389843236411431   0.398105880367068 
##                   1                   1                   1 
##   0.417941560199702   0.475509528899663   0.487429052428485 
##                   1                   1                   1 
##   0.556663198673657   0.558486425565304   0.569719627442413 
##                   1                   1                   1 
##   0.575781351653492   0.593901321217509   0.593946187628422 
##                   1                   1                   1 
##   0.610726353489055    0.61982574789471   0.689739362450777 
##                   1                   1                   1 
##   0.696963375404737   0.700213649514998   0.738324705129217 
##                   1                   1                   1 
##   0.763175748457544   0.768532924515416   0.782136300731067 
##                   1                   1                   1 
##   0.821221195098089   0.881107726454215   0.918977371608218 
##                   1                   1                   1 
##   0.943836210685299    1.06309983727636    1.10002537198388 
##                   1                   1                   1 
##    1.12493091814311    1.16040261569495     1.1780869965732 
##                   1                   1                   1 
##    1.20786780598317    1.35867955152904    1.43302370170104 
##                   1                   1                   1 
##    1.46555486156289    1.51178116845085    1.58683345454085 
##                   1                   1                   1 
##    1.59528080213779    1.98039989850586    2.17261167036215 
##                   1                   1                   1 
##    2.40161776050478 
##                   1

A summary of a dataframe will return the summary information separate for each column vector. This may look produce different result for each column, depending on the class of the column:

summary(data.frame(a = 1:10, b = 11:20))
##        a               b       
##  Min.   : 1.00   Min.   :11.0  
##  1st Qu.: 3.25   1st Qu.:13.2  
##  Median : 5.50   Median :15.5  
##  Mean   : 5.50   Mean   :15.5  
##  3rd Qu.: 7.75   3rd Qu.:17.8  
##  Max.   :10.00   Max.   :20.0
summary(data.frame(a = 1:10, b = factor(11:20)))
##        a               b    
##  Min.   : 1.00   11     :1  
##  1st Qu.: 3.25   12     :1  
##  Median : 5.50   13     :1  
##  Mean   : 5.50   14     :1  
##  3rd Qu.: 7.75   15     :1  
##  Max.   :10.00   16     :1  
##                  (Other):4

A summary of a list will return not very useful information:

summary(list(a = 1:10, b = 1:10))
##   Length Class  Mode   
## a 10     -none- numeric
## b 10     -none- numeric

A summary of a matrix returns a summary of each column separately (like a dataframe):

summary(matrix(1:20, nrow = 4))
##        V1             V2             V3              V4      
##  Min.   :1.00   Min.   :5.00   Min.   : 9.00   Min.   :13.0  
##  1st Qu.:1.75   1st Qu.:5.75   1st Qu.: 9.75   1st Qu.:13.8  
##  Median :2.50   Median :6.50   Median :10.50   Median :14.5  
##  Mean   :2.50   Mean   :6.50   Mean   :10.50   Mean   :14.5  
##  3rd Qu.:3.25   3rd Qu.:7.25   3rd Qu.:11.25   3rd Qu.:15.2  
##  Max.   :4.00   Max.   :8.00   Max.   :12.00   Max.   :16.0  
##        V5      
##  Min.   :17.0  
##  1st Qu.:17.8  
##  Median :18.5  
##  Mean   :18.5  
##  3rd Qu.:19.2  
##  Max.   :20.0