Dataframe rearrangement

In addition to knowing how to index and view dataframes, as is discussed in other tutorials, it is also helpful to be able to adjust the arrangement of dataframes. By this I mean that it is sometimes helpful to split, sample, reorder, reshape, or otherwise change the organization of a dataframe. This tutorial explains a couple of functions that can help with these kinds of tasks. Note: One of the most important things to remember about R dataframes is that it rarely, if ever, matters what order observations or variables are have in a dataframe. Whereas in SPSS and SAS observations have to be sorted before performing operations, R does not require such sorting.

Column order

Sometimes we want to get dataframe columns in a different order from how they're read into the data. In most cases, though, we can just index the dataframe to see relevant columns rather reordering, but we can do the reordering if we want. Say we have the following 5-column dataframe:

set.seed(50)
mydf <- data.frame(a = rep(1:2, each = 10), b = rep(1:4, times = 5), c = rnorm(20), 
    d = rnorm(20), e = sample(1:20, 20, FALSE))
head(mydf)
##   a b       c       d  e
## 1 1 1  0.5497 -0.3499 11
## 2 1 2 -0.8416 -0.5869  1
## 3 1 3  0.0330 -1.5899  7
## 4 1 4  0.5241  1.6896  8
## 5 1 1 -1.7276  0.5636  9
## 6 1 2 -0.2779  2.6676 19

To view the columns in a different order, we can simply index the dataframe differently either by name or column position:

head(mydf[, c(3, 4, 5, 1, 2)])
##         c       d  e a b
## 1  0.5497 -0.3499 11 1 1
## 2 -0.8416 -0.5869  1 1 2
## 3  0.0330 -1.5899  7 1 3
## 4  0.5241  1.6896  8 1 4
## 5 -1.7276  0.5636  9 1 1
## 6 -0.2779  2.6676 19 1 2
head(mydf[, c("c", "d", "e", "a", "b")])
##         c       d  e a b
## 1  0.5497 -0.3499 11 1 1
## 2 -0.8416 -0.5869  1 1 2
## 3  0.0330 -1.5899  7 1 3
## 4  0.5241  1.6896  8 1 4
## 5 -1.7276  0.5636  9 1 1
## 6 -0.2779  2.6676 19 1 2

We can save the adjusted column order if we want:

mydf <- mydf[, c(3, 4, 5, 1, 2)]
head(mydf)
##         c       d  e a b
## 1  0.5497 -0.3499 11 1 1
## 2 -0.8416 -0.5869  1 1 2
## 3  0.0330 -1.5899  7 1 3
## 4  0.5241  1.6896  8 1 4
## 5 -1.7276  0.5636  9 1 1
## 6 -0.2779  2.6676 19 1 2

Row order

Changing row order works the same way as changing column order. We can simply index the dataframe in a different way. For example, let's say we want to reverse the order of the dataframe, we can simply write:

mydf[nrow(mydf):1, ]
##          c        d  e a b
## 20 -0.3234  0.39322  5 2 4
## 19 -1.1660  0.40619  4 2 3
## 18 -0.7653  1.83968  3 2 2
## 17 -0.1568  0.41620 20 2 1
## 16 -0.3629  0.01910  2 2 4
## 15 -0.4555 -1.09605 14 2 3
## 14  0.1957  0.59725  6 2 2
## 13 -0.4986 -1.13045 13 2 1
## 12  0.5548 -0.85142 10 2 4
## 11  0.2952  0.19902 12 2 3
## 10 -1.4457  0.02867 17 1 2
## 9   0.9756  0.56875 18 1 1
## 8  -0.5909 -0.36212 16 1 4
## 7   0.3608  0.35653 15 1 3
## 6  -0.2779  2.66763 19 1 2
## 5  -1.7276  0.56358  9 1 1
## 4   0.5241  1.68956  8 1 4
## 3   0.0330 -1.58988  7 1 3
## 2  -0.8416 -0.58690  1 1 2
## 1   0.5497 -0.34993 11 1 1

And then we can save this new order if we want:

mydf <- mydf[nrow(mydf):1, ]

Rarely, however, do we want to just reorder by hand. Instead we might want to reorder according to the values of a column. One's intuition might be to use the sort function because it is used to sort a vector:

mydf$e
##  [1]  5  4  3 20  2 14  6 13 10 12 17 18 16 15 19  9  8  7  1 11
sort(mydf$e)
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

But trying to run sort(mydf) will produce an error. Instead, we need to use the order function, which returns the indexes of a sorted vector. Confusing? let's see how it works:

order(mydf$e)
##  [1] 19  5  3  2  1  7 18 17 16  9 20 10  8  6 14 13 11 12 15  4

That doesn't look like a sorted vector, but this is because the values being shown are the indices of the vector, not the values themselves. If we index the mydf$e vector by the output of order(mydf$e), it will be in the order we're expecting:

mydf$e[order(mydf$e)]
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

We can apply this same logic to sorting a dataframe. We simply pick which column we want to order by and then use the output of order as a row index. Let's compare the reordered dataframe to the original:

head(mydf[order(mydf$e), ])
##          c       d e a b
## 2  -0.8416 -0.5869 1 1 2
## 16 -0.3629  0.0191 2 2 4
## 18 -0.7653  1.8397 3 2 2
## 19 -1.1660  0.4062 4 2 3
## 20 -0.3234  0.3932 5 2 4
## 14  0.1957  0.5972 6 2 2
head(mydf)  # original
##          c       d  e a b
## 20 -0.3234  0.3932  5 2 4
## 19 -1.1660  0.4062  4 2 3
## 18 -0.7653  1.8397  3 2 2
## 17 -0.1568  0.4162 20 2 1
## 16 -0.3629  0.0191  2 2 4
## 15 -0.4555 -1.0960 14 2 3

Of course, we could save the reordered dataframe just as above:

mydf <- mydf[order(mydf$e), ]

Subset of rows

Another common operation is to look at a subset of dataframe rows. For example, we might want to look at just the rows where mydf$a==1. Remembering the rules for indexing a dataframe, we can simply index according to a logical rule:

mydf[mydf$a == 1, ]
##          c        d  e a b
## 2  -0.8416 -0.58690  1 1 2
## 3   0.0330 -1.58988  7 1 3
## 4   0.5241  1.68956  8 1 4
## 5  -1.7276  0.56358  9 1 1
## 1   0.5497 -0.34993 11 1 1
## 7   0.3608  0.35653 15 1 3
## 8  -0.5909 -0.36212 16 1 4
## 10 -1.4457  0.02867 17 1 2
## 9   0.9756  0.56875 18 1 1
## 6  -0.2779  2.66763 19 1 2

And to get the rows where mydf$a==2, we can do quite the same operation:

mydf[mydf$a == 2, ]
##          c       d  e a b
## 16 -0.3629  0.0191  2 2 4
## 18 -0.7653  1.8397  3 2 2
## 19 -1.1660  0.4062  4 2 3
## 20 -0.3234  0.3932  5 2 4
## 14  0.1957  0.5972  6 2 2
## 12  0.5548 -0.8514 10 2 4
## 11  0.2952  0.1990 12 2 3
## 13 -0.4986 -1.1304 13 2 1
## 15 -0.4555 -1.0960 14 2 3
## 17 -0.1568  0.4162 20 2 1

We can also combine logical rules to get a further subset of values:

mydf[mydf$a == 1 & mydf$b == 4, ]
##         c       d  e a b
## 4  0.5241  1.6896  8 1 4
## 8 -0.5909 -0.3621 16 1 4

And we need to restrict ourselves to equivalency logicals:

mydf[mydf$a == 1 & mydf$b > 2, ]
##         c       d  e a b
## 3  0.0330 -1.5899  7 1 3
## 4  0.5241  1.6896  8 1 4
## 7  0.3608  0.3565 15 1 3
## 8 -0.5909 -0.3621 16 1 4

R also supplies a subset function, which can be used to select subsets of rows, subsets of columns, or both. It works like so:

# subset of rows:
subset(mydf, a == 1)
##          c        d  e a b
## 2  -0.8416 -0.58690  1 1 2
## 3   0.0330 -1.58988  7 1 3
## 4   0.5241  1.68956  8 1 4
## 5  -1.7276  0.56358  9 1 1
## 1   0.5497 -0.34993 11 1 1
## 7   0.3608  0.35653 15 1 3
## 8  -0.5909 -0.36212 16 1 4
## 10 -1.4457  0.02867 17 1 2
## 9   0.9756  0.56875 18 1 1
## 6  -0.2779  2.66763 19 1 2
subset(mydf, a == 1 & b > 2)
##         c       d  e a b
## 3  0.0330 -1.5899  7 1 3
## 4  0.5241  1.6896  8 1 4
## 7  0.3608  0.3565 15 1 3
## 8 -0.5909 -0.3621 16 1 4
# subset of columns:
subset(mydf, select = c("a", "b"))
##    a b
## 2  1 2
## 16 2 4
## 18 2 2
## 19 2 3
## 20 2 4
## 14 2 2
## 3  1 3
## 4  1 4
## 5  1 1
## 12 2 4
## 1  1 1
## 11 2 3
## 13 2 1
## 15 2 3
## 7  1 3
## 8  1 4
## 10 1 2
## 9  1 1
## 6  1 2
## 17 2 1
# subset of rows and columns:
subset(mydf, a == 1 & b > 2, select = c("c", "d"))
##         c       d
## 3  0.0330 -1.5899
## 4  0.5241  1.6896
## 7  0.3608  0.3565
## 8 -0.5909 -0.3621

Using indices and subset are equivalent, but the indexing syntax is more general.

Splitting a dataframe

In one of the above examples, we extracted two separate dataframes: one for mydf$a==1 and one for mydf$a==2. We can actually achieve that result using a single line of code involving the split function, which returns a list of dataframes, separated out by a grouping factor:

split(mydf, mydf$a)
## $`1`
##          c        d  e a b
## 2  -0.8416 -0.58690  1 1 2
## 3   0.0330 -1.58988  7 1 3
## 4   0.5241  1.68956  8 1 4
## 5  -1.7276  0.56358  9 1 1
## 1   0.5497 -0.34993 11 1 1
## 7   0.3608  0.35653 15 1 3
## 8  -0.5909 -0.36212 16 1 4
## 10 -1.4457  0.02867 17 1 2
## 9   0.9756  0.56875 18 1 1
## 6  -0.2779  2.66763 19 1 2
## 
## $`2`
##          c       d  e a b
## 16 -0.3629  0.0191  2 2 4
## 18 -0.7653  1.8397  3 2 2
## 19 -1.1660  0.4062  4 2 3
## 20 -0.3234  0.3932  5 2 4
## 14  0.1957  0.5972  6 2 2
## 12  0.5548 -0.8514 10 2 4
## 11  0.2952  0.1990 12 2 3
## 13 -0.4986 -1.1304 13 2 1
## 15 -0.4555 -1.0960 14 2 3
## 17 -0.1568  0.4162 20 2 1

We can also split by multiple factors, e.g., a dataframe for every unique combination of mydf$a and mydf$b:

split(mydf, list(mydf$a, mydf$b))
## $`1.1`
##         c       d  e a b
## 5 -1.7276  0.5636  9 1 1
## 1  0.5497 -0.3499 11 1 1
## 9  0.9756  0.5687 18 1 1
## 
## $`2.1`
##          c       d  e a b
## 13 -0.4986 -1.1304 13 2 1
## 17 -0.1568  0.4162 20 2 1
## 
## $`1.2`
##          c        d  e a b
## 2  -0.8416 -0.58690  1 1 2
## 10 -1.4457  0.02867 17 1 2
## 6  -0.2779  2.66763 19 1 2
## 
## $`2.2`
##          c      d e a b
## 18 -0.7653 1.8397 3 2 2
## 14  0.1957 0.5972 6 2 2
## 
## $`1.3`
##        c       d  e a b
## 3 0.0330 -1.5899  7 1 3
## 7 0.3608  0.3565 15 1 3
## 
## $`2.3`
##          c       d  e a b
## 19 -1.1660  0.4062  4 2 3
## 11  0.2952  0.1990 12 2 3
## 15 -0.4555 -1.0960 14 2 3
## 
## $`1.4`
##         c       d  e a b
## 4  0.5241  1.6896  8 1 4
## 8 -0.5909 -0.3621 16 1 4
## 
## $`2.4`
##          c       d  e a b
## 16 -0.3629  0.0191  2 2 4
## 20 -0.3234  0.3932  5 2 4
## 12  0.5548 -0.8514 10 2 4

Having our dataframes stored inside another object might seem inconvenient, but it actually is vary useful because we can use functions like lapply to perform an operation on every dataframe in the list. For example, we could get the summary of every variable in each of two subsets of the dataframe in a single line of code:

lapply(split(mydf, mydf$a), summary)
## $`1`
##        c                d                e               a    
##  Min.   :-1.728   Min.   :-1.590   Min.   : 1.00   Min.   :1  
##  1st Qu.:-0.779   1st Qu.:-0.359   1st Qu.: 8.25   1st Qu.:1  
##  Median :-0.122   Median : 0.193   Median :13.00   Median :1  
##  Mean   :-0.244   Mean   : 0.299   Mean   :12.10   Mean   :1  
##  3rd Qu.: 0.483   3rd Qu.: 0.568   3rd Qu.:16.75   3rd Qu.:1  
##  Max.   : 0.976   Max.   : 2.668   Max.   :19.00   Max.   :1  
##        b       
##  Min.   :1.00  
##  1st Qu.:1.25  
##  Median :2.00  
##  Mean   :2.30  
##  3rd Qu.:3.00  
##  Max.   :4.00  
## 
## $`2`
##        c                d                 e               a    
##  Min.   :-1.166   Min.   :-1.1304   Min.   : 2.00   Min.   :2  
##  1st Qu.:-0.488   1st Qu.:-0.6338   1st Qu.: 4.25   1st Qu.:2  
##  Median :-0.343   Median : 0.2961   Median : 8.00   Median :2  
##  Mean   :-0.268   Mean   : 0.0793   Mean   : 8.90   Mean   :2  
##  3rd Qu.: 0.108   3rd Qu.: 0.4137   3rd Qu.:12.75   3rd Qu.:2  
##  Max.   : 0.555   Max.   : 1.8397   Max.   :20.00   Max.   :2  
##        b       
##  Min.   :1.00  
##  1st Qu.:2.00  
##  Median :3.00  
##  Mean   :2.70  
##  3rd Qu.:3.75  
##  Max.   :4.00

Sampling and permutations

Another common task is random sampling or permutation of rows in a dataframe. For example, we might want to build a regression model on a random subset of cases (a “training set”) and then test the model on the remaining case (a “test set”). Or, we might want to look at a random sample of the observations (e.g., perhaps to speed up a very time-consuming analysis). Let's consider the case of sampling for “training” and “test” sets. To obtain a random sample, we have two choices. We can either sample a specified number of rows or we can use a logical index to sample rows based on a specified probability. Both use the sample function. To look at, e.g., exactly five randomly selected rows from our data frame as the training set, we can do the following:

s <- sample(1:nrow(mydf), 5, FALSE)
s
## [1] 11 19 12  1 17

Note: The third argument (FALSE) refers to whether sampling should be done with replacement. We can then use that directly as a row index:

mydf[s, ]
##          c        d  e a b
## 1   0.5497 -0.34993 11 1 1
## 6  -0.2779  2.66763 19 1 2
## 11  0.2952  0.19902 12 2 3
## 2  -0.8416 -0.58690  1 1 2
## 10 -1.4457  0.02867 17 1 2

To see the test set, we simply drop all rows not in s:

mydf[-s, ]
##          c       d  e a b
## 16 -0.3629  0.0191  2 2 4
## 18 -0.7653  1.8397  3 2 2
## 19 -1.1660  0.4062  4 2 3
## 20 -0.3234  0.3932  5 2 4
## 14  0.1957  0.5972  6 2 2
## 3   0.0330 -1.5899  7 1 3
## 4   0.5241  1.6896  8 1 4
## 5  -1.7276  0.5636  9 1 1
## 12  0.5548 -0.8514 10 2 4
## 13 -0.4986 -1.1304 13 2 1
## 15 -0.4555 -1.0960 14 2 3
## 7   0.3608  0.3565 15 1 3
## 8  -0.5909 -0.3621 16 1 4
## 9   0.9756  0.5687 18 1 1
## 17 -0.1568  0.4162 20 2 1

An alternative is to get a random 20% of the rows but not require that to be exactly five observations. To do that, we make 20 random draws (i.e., a number of draws equal to the number of rows in our dataframe) from a binomial distribution with probability .2:

s2 <- rbinom(nrow(mydf), 1, 0.2)
s2
##  [1] 1 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0

We can then use that directly as a row index:

mydf[s2, ]
##           c       d e a b
## 2   -0.8416 -0.5869 1 1 2
## 2.1 -0.8416 -0.5869 1 1 2
## 2.2 -0.8416 -0.5869 1 1 2
## 2.3 -0.8416 -0.5869 1 1 2

And again see the test set as those observations not in s2.

mydf[!s2, ]
##          c        d  e a b
## 16 -0.3629  0.01910  2 2 4
## 18 -0.7653  1.83968  3 2 2
## 20 -0.3234  0.39322  5 2 4
## 3   0.0330 -1.58988  7 1 3
## 4   0.5241  1.68956  8 1 4
## 5  -1.7276  0.56358  9 1 1
## 12  0.5548 -0.85142 10 2 4
## 1   0.5497 -0.34993 11 1 1
## 11  0.2952  0.19902 12 2 3
## 13 -0.4986 -1.13045 13 2 1
## 15 -0.4555 -1.09605 14 2 3
## 8  -0.5909 -0.36212 16 1 4
## 10 -1.4457  0.02867 17 1 2
## 9   0.9756  0.56875 18 1 1
## 6  -0.2779  2.66763 19 1 2
## 17 -0.1568  0.41620 20 2 1

Note: Here we use !s2 because s2 is a logical index, whereas above we used -s because s was a positional index.