When it comes to performing statistical analysis in R, the most important object type is a dataframe. When we load data into R or use R to conduct statistical tests or build models, we want to have our data as a dataframe. A dataframe is actually a special type of list that has some properties that facilitate using it for data analysis.
To create a dataframe, we use the data.frame
function:
a <- data.frame(1:3)
a
## X1.3
## 1 1
## 2 2
## 3 3
This example is a single vector coerced into being a dataframe.
Our input vector 1:3
is printed as a column and the dataframe has row names:
rownames(a)
## [1] "1" "2" "3"
And the vector has been automatically given a column name:
colnames(a)
## [1] "X1.3"
Note: We can also see the column names of a dataframe using names
:
names(a)
## [1] "X1.3"
Like a matrix, we can see that this dataframe has dimensions:
dim(a)
## [1] 3 1
Which we can observe as row and column dimensions:
nrow(a)
## [1] 3
ncol(a)
## [1] 1
But having a dataframe consisting of one column vector isn't very helpful. In general we want to have multiple columns, where each column is a variable and each row is an observation.
b <- data.frame(1:3, 4:6)
b
## X1.3 X4.6
## 1 1 4
## 2 2 5
## 3 3 6
You can see the similarity to building a list and indeed if we check whether our dataframe is a list, it is:
is.data.frame(b)
## [1] TRUE
is.list(b)
## [1] TRUE
Our new dataframe b
now has two column variables and the same number of rows.
The names of the dataframe are assigned automatically, but we can change them:
names(b)
## [1] "X1.3" "X4.6"
names(b) <- c("var1", "var2")
names(b)
## [1] "var1" "var2"
We can also assign names when we create a dataframe, just as we did with a list:
d <- data.frame(var1 = 1:3, var2 = 4:6)
names(d)
## [1] "var1" "var2"
d
## var1 var2
## 1 1 4
## 2 2 5
## 3 3 6
Indexing dataframes works similarly to both lists and matrices. Even though our dataframe isn't a matrix:
is.matrix(d)
## [1] FALSE
We can still index it in two dimensions like a matrix to extract rows, columns, or elements:
d[1, ] #' row
## var1 var2
## 1 1 4
d[, 2] #' column
## [1] 4 5 6
d[3, 2] #' element
## [1] 6
Because dataframes are actually lists, we can index them just like we would a list: For example, to get a dataframe containing only our first column variable, we can use single brackets:
d[1]
## var1
## 1 1
## 2 2
## 3 3
The same result is possible with named indexing:
d["var1"]
## var1
## 1 1
## 2 2
## 3 3
To get that column variable as a vector instead of a one-column dataframe, we can use double brackets:
d[[1]]
## [1] 1 2 3
And we can also use named indexing as we would in a list:
d[["var1"]]
## [1] 1 2 3
d$var1
## [1] 1 2 3
And, we can combine indexing like we did with a list to get the elements of a column vector:
d[["var2"]][3]
## [1] 6
d$var2[3]
## [1] 6
We can also use -
indexing to exclude columns:
d[, -1]
## [1] 4 5 6
or rows:
d[-2, ]
## var1 var2
## 1 1 4
## 3 3 6
Thus, it is very easy to extract different parts of a dataframe in different ways, depending on what we want to do.
With those indexing rules, it is also very easy to change dataframe elements. For example, to add a column variable, we just need to add a vector with a name:
d$var3 <- 7:9
d
## var1 var2 var3
## 1 1 4 7
## 2 2 5 8
## 3 3 6 9
If we try to add a vector that is shorter than the number of dataframe rows, recycling is invoked:
d$var4 <- 1
d
## var1 var2 var3 var4
## 1 1 4 7 1
## 2 2 5 8 1
## 3 3 6 9 1
If we try to add a vector that is longer than the number of dataframe rows, we get a error:
d$var4 <- 1:4
## Error: replacement has 4 rows, data has 3
So even though a dataframe is like a list, it has the restriction that all columns must have the same length.
We can also remove dataframe columns by setting them equal to NULL:
d
## var1 var2 var3 var4
## 1 1 4 7 1
## 2 2 5 8 1
## 3 3 6 9 1
d$var4 <- NULL
d
## var1 var2 var3
## 1 1 4 7
## 2 2 5 8
## 3 3 6 9
This permanently removes the column variable from the dataframe and reduces its dimensions. To remove rows, you simply using positional indexing as described above and assign the result as itself:
d
## var1 var2 var3
## 1 1 4 7
## 2 2 5 8
## 3 3 6 9
d[-2, ]
## var1 var2 var3
## 1 1 4 7
## 3 3 6 9
d <- d[-2, ]
d
## var1 var2 var3
## 1 1 4 7
## 3 3 6 9
This highlights an important point. Unless we assign using <-
, we are not modifying the dataframe, only changing what is displayed.
If we want to preserve a dataframe and a modified version of it, we can simply assign the modified version a new name:
d
## var1 var2 var3
## 1 1 4 7
## 3 3 6 9
d2 <- d[, -1]
This leaves our original dataframe unchanged:
d
## var1 var2 var3
## 1 1 4 7
## 3 3 6 9
And gives us a new object reflecting the modified dataframe:
d2
## var2 var3
## 1 4 7
## 3 6 9
Another similarly between dataframes and matrices is that we can bind them columnwise:
e1 <- data.frame(1:3, 4:6)
e2 <- data.frame(7:9, 10:12)
cbind(e1, e2)
## X1.3 X4.6 X7.9 X10.12
## 1 1 4 7 10
## 2 2 5 8 11
## 3 3 6 9 12
To bind them rowwise, however, the two dataframes need to have matching names
:
names(e1) <- names(e2) <- c("Var1", "Var2")
rbind(e1, e2)
## Var1 Var2
## 1 1 4
## 2 2 5
## 3 3 6
## 4 7 10
## 5 8 11
## 6 9 12
Dataframes can also be combined using the merge
function.
merge
is powerful, but can also be confusing.
Let's imagine that our two dataframes contain observations for the same three individuals, but in different orders:
e1$id <- 1:3
e2$id <- c(2, 1, 3)
We should also rename the variables in e2
to show that these are unique variables:
names(e2)[1:2] <- c("Var3", "Var4")
If we use cbind
to combine the data, variables from observations in the two dataframes will be mismatched:
cbind(e1, e2)
## Var1 Var2 id Var3 Var4 id
## 1 1 4 1 7 10 2
## 2 2 5 2 8 11 1
## 3 3 6 3 9 12 3
This is where merge comes in handy because we can specify a by
parameter:
e3 <- merge(e1, e2, by = "id")
The result is a single dataframe, with a single id
variable and observations from the two dataframes are matched appropriately.
That was a simple example, but what if our dataframes have different (but overlapping) sets of observations.
e4 <- data.frame(Var5 = 10:1, Var6 = c(5:1, 1:5), id = c(1:2, 4:11))
e4
## Var5 Var6 id
## 1 10 5 1
## 2 9 4 2
## 3 8 3 4
## 4 7 2 5
## 5 6 1 6
## 6 5 1 7
## 7 4 2 8
## 8 3 3 9
## 9 2 4 10
## 10 1 5 11
This new dataframe e4
has two observations common to the previous dataframes (1 and 2) but no observation for 3.
If we merge e3
and e4
, what do we get?
merge(e3, e4, by = "id")
## id Var1 Var2 Var3 Var4 Var5 Var6
## 1 1 1 4 8 11 10 5
## 2 2 2 5 7 10 9 4
The result is all variables (columns) for the two common observations (1 and 2). If we want to include observation 3, we can use:
merge(e3, e4, by = "id", all.x = TRUE)
## id Var1 Var2 Var3 Var4 Var5 Var6
## 1 1 1 4 8 11 10 5
## 2 2 2 5 7 10 9 4
## 3 3 3 6 9 12 NA NA
Note: The all.x
argument refers to which observations from the first dataframe (e3
) we want to preserve.
If we want to include observations 4 to 11, we can use:
merge(e3, e4, by = "id", all.y = TRUE)
## id Var1 Var2 Var3 Var4 Var5 Var6
## 1 1 1 4 8 11 10 5
## 2 2 2 5 7 10 9 4
## 3 4 NA NA NA NA 8 3
## 4 5 NA NA NA NA 7 2
## 5 6 NA NA NA NA 6 1
## 6 7 NA NA NA NA 5 1
## 7 8 NA NA NA NA 4 2
## 8 9 NA NA NA NA 3 3
## 9 10 NA NA NA NA 2 4
## 10 11 NA NA NA NA 1 5
Note: The all.y
argument refers to which observations from the second dataframe (e4
) we want to preserve.
Of course, we can preserve both with either:
merge(e3, e4, by = "id", all.x = TRUE, all.y = TRUE)
## id Var1 Var2 Var3 Var4 Var5 Var6
## 1 1 1 4 8 11 10 5
## 2 2 2 5 7 10 9 4
## 3 3 3 6 9 12 NA NA
## 4 4 NA NA NA NA 8 3
## 5 5 NA NA NA NA 7 2
## 6 6 NA NA NA NA 6 1
## 7 7 NA NA NA NA 5 1
## 8 8 NA NA NA NA 4 2
## 9 9 NA NA NA NA 3 3
## 10 10 NA NA NA NA 2 4
## 11 11 NA NA NA NA 1 5
merge(e3, e4, by = "id", all = TRUE)
## id Var1 Var2 Var3 Var4 Var5 Var6
## 1 1 1 4 8 11 10 5
## 2 2 2 5 7 10 9 4
## 3 3 3 6 9 12 NA NA
## 4 4 NA NA NA NA 8 3
## 5 5 NA NA NA NA 7 2
## 6 6 NA NA NA NA 6 1
## 7 7 NA NA NA NA 5 1
## 8 8 NA NA NA NA 4 2
## 9 9 NA NA NA NA 3 3
## 10 10 NA NA NA NA 2 4
## 11 11 NA NA NA NA 1 5
These two R statements are equivalent.
Note: If we set by=NULL
, we get a potentially unexpected result:
merge(e3, e4, by = NULL)
## id.x Var1 Var2 Var3 Var4 Var5 Var6 id.y
## 1 1 1 4 8 11 10 5 1
## 2 2 2 5 7 10 10 5 1
## 3 3 3 6 9 12 10 5 1
## 4 1 1 4 8 11 9 4 2
## 5 2 2 5 7 10 9 4 2
## 6 3 3 6 9 12 9 4 2
## 7 1 1 4 8 11 8 3 4
## 8 2 2 5 7 10 8 3 4
## 9 3 3 6 9 12 8 3 4
## 10 1 1 4 8 11 7 2 5
## 11 2 2 5 7 10 7 2 5
## 12 3 3 6 9 12 7 2 5
## 13 1 1 4 8 11 6 1 6
## 14 2 2 5 7 10 6 1 6
## 15 3 3 6 9 12 6 1 6
## 16 1 1 4 8 11 5 1 7
## 17 2 2 5 7 10 5 1 7
## 18 3 3 6 9 12 5 1 7
## 19 1 1 4 8 11 4 2 8
## 20 2 2 5 7 10 4 2 8
## 21 3 3 6 9 12 4 2 8
## 22 1 1 4 8 11 3 3 9
## 23 2 2 5 7 10 3 3 9
## 24 3 3 6 9 12 3 3 9
## 25 1 1 4 8 11 2 4 10
## 26 2 2 5 7 10 2 4 10
## 27 3 3 6 9 12 2 4 10
## 28 1 1 4 8 11 1 5 11
## 29 2 2 5 7 10 1 5 11
## 30 3 3 6 9 12 1 5 11
If we leave by blank, the default is to merge based on the common variable names in both dataframes.
We can also separately specify by
for each dataframe:
merge(e3, e4, by.x = "id", by.y = "id")
## id Var1 Var2 Var3 Var4 Var5 Var6
## 1 1 1 4 8 11 10 5
## 2 2 2 5 7 10 9 4
This would be helpful if the identifier variable had a different name in each dataframe.
Note: merge
only works with two dataframes. So, if multiple dataframes need to be merged, it must be done sequentially:
merge(merge(e1, e2), e4)
## id Var1 Var2 Var3 Var4 Var5 Var6
## 1 1 1 4 8 11 10 5
## 2 2 2 5 7 10 9 4