One of the most common analytic tasks is creating variables. For example, we have some variable that we need to use in the analysis, but we want it to have a mean of zero or be confined to [0,1]. Alternatively, we might have a large number of indicators that we need to aggregate into a single variable.
When we used R as a calculator, we learned that R is “vectorized”. This means that when we call a function like add (+
), it adds each respective element of two vectors together. For example:
(1:3) + (10:12)
## [1] 11 13 15
This returns a three-element vector that added each corresponding element of the two vectors together. We also should remember R's tendency to use “recyling”:
(1:3) + 10
## [1] 11 12 13
Here, the second vector only has one element, so R assumes that you want to add 10 to each element of the first vector (as opposed to adding 10 to the first element and nothing to the second and third elements). This is really helpful for preparing data vectors because it means we can use mathematical operators (addition, subtraction, multiplication, division, powers, logs, etc.) for their intuitive purposes when trying to create new variables rather than having to rely on obscure function names. But R also has a number of other functions for building variables.
Let's examine all of these features using some made-up data. In this case, we'll create a dataframe of indicator variables (coded 0 and 1) and build them into various scales.
set.seed(1)
n <- 30
mydf <- data.frame(x1 = rbinom(n, 1, 0.5), x2 = rbinom(n, 1, 0.1), x3 = rbinom(n,
1, 0.5), x4 = rbinom(n, 1, 0.8), x5 = 1, x6 = sample(c(0, 1, NA), n, TRUE))
Let's use str
and summary
to get a quick sense of the data:
str(mydf)
## 'data.frame': 30 obs. of 6 variables:
## $ x1: int 0 0 1 1 0 1 1 1 1 0 ...
## $ x2: int 0 0 0 0 0 0 0 0 0 0 ...
## $ x3: int 1 0 0 0 1 0 0 1 0 1 ...
## $ x4: int 1 1 1 0 1 1 1 1 0 1 ...
## $ x5: num 1 1 1 1 1 1 1 1 1 1 ...
## $ x6: num NA 1 1 0 NA 1 1 0 0 1 ...
summary(mydf)
## x1 x2 x3 x4 x5
## Min. :0.000 Min. :0 Min. :0.0 Min. :0.000 Min. :1
## 1st Qu.:0.000 1st Qu.:0 1st Qu.:0.0 1st Qu.:1.000 1st Qu.:1
## Median :0.000 Median :0 Median :0.0 Median :1.000 Median :1
## Mean :0.467 Mean :0 Mean :0.4 Mean :0.833 Mean :1
## 3rd Qu.:1.000 3rd Qu.:0 3rd Qu.:1.0 3rd Qu.:1.000 3rd Qu.:1
## Max. :1.000 Max. :0 Max. :1.0 Max. :1.000 Max. :1
##
## x6
## Min. :0.000
## 1st Qu.:0.000
## Median :1.000
## Mean :0.591
## 3rd Qu.:1.000
## Max. :1.000
## NA's :8
All variables are coded 0 or 1, x5
is all 1's, and x6
contains some missing data (NA
) values.
The easiest scales are those that add or substract variables. Let's try that quick:
mydf$x1 + mydf$x2
## [1] 0 0 1 1 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 1 0 1 0 0 0 0 0 1 0
mydf$x1 + mydf$x2 + mydf$x3
## [1] 1 0 1 1 1 1 1 2 1 1 0 1 1 0 1 1 2 1 1 2 1 1 1 0 1 0 1 0 1 0
mydf$x1 + mydf$x2 - mydf$x3
## [1] -1 0 1 1 -1 1 1 0 1 -1 0 -1 1 0 1 -1 0 1 -1 0 1 -1 1
## [24] 0 -1 0 -1 0 1 0
One way to save some typing is to use the with
command, which simply tells R which dataframe to look in for variables:
with(mydf, x1 + x2 - x3)
## [1] -1 0 1 1 -1 1 1 0 1 -1 0 -1 1 0 1 -1 0 1 -1 0 1 -1 1
## [24] 0 -1 0 -1 0 1 0
A faster way to take a rowsum is to use rowSums
:
rowSums(mydf)
## [1] NA 3 4 2 NA 4 4 4 2 4 3 3 3 2 NA 4 5 4 NA 5 NA 4 3
## [24] 2 NA 3 3 NA 3 NA
Because we have missing data, any row that has an NA results in a sum of 0
. We could either skip that column:
rowSums(mydf[, 1:5])
## [1] 3 2 3 2 3 3 3 4 2 3 2 3 3 1 3 3 4 3 2 4 2 3 3 2 3 2 3 2 3 2
or use the na.rm=TRUE
argument to skip NA
values when calculating the sum:
rowSums(mydf, na.rm = TRUE)
## [1] 3 3 4 2 3 4 4 4 2 4 3 3 3 2 3 4 5 4 2 5 2 4 3 2 3 3 3 2 3 2
or we could look at a reduced dataset, eliminating all rows from the result that have a missing value:
rowSums(na.omit(mydf))
## 2 3 4 6 7 8 9 10 11 12 13 14 16 17 18 20 22 23 24 26 27 29
## 3 4 2 4 4 4 2 4 3 3 3 2 4 5 4 5 4 3 2 3 3 3
but this last option can create problems if we try to store the result back into our original data (since it has fewer elements than the original dataframe has rows).
We can also multiply (or divide) across variables. For these indicator variables, that applies an AND logic to tell us if all of the variables are 1:
with(mydf, x3 * x4 * x5)
## [1] 1 0 0 0 1 0 0 1 0 1 0 1 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0
We might also want to take an average value across all the columns, which we could do by hand:
with(mydf, x1 + x2 + x3 + x4 + x5 + x6)/6
## [1] NA 0.5000 0.6667 0.3333 NA 0.6667 0.6667 0.6667 0.3333 0.6667
## [11] 0.5000 0.5000 0.5000 0.3333 NA 0.6667 0.8333 0.6667 NA 0.8333
## [21] NA 0.6667 0.5000 0.3333 NA 0.5000 0.5000 NA 0.5000 NA
or use the rowSums
function from earlier:
rowSums(mydf)/6
## [1] NA 0.5000 0.6667 0.3333 NA 0.6667 0.6667 0.6667 0.3333 0.6667
## [11] 0.5000 0.5000 0.5000 0.3333 NA 0.6667 0.8333 0.6667 NA 0.8333
## [21] NA 0.6667 0.5000 0.3333 NA 0.5000 0.5000 NA 0.5000 NA
or use the even simpler rowMeans
function:
rowMeans(mydf)
## [1] NA 0.5000 0.6667 0.3333 NA 0.6667 0.6667 0.6667 0.3333 0.6667
## [11] 0.5000 0.5000 0.5000 0.3333 NA 0.6667 0.8333 0.6667 NA 0.8333
## [21] NA 0.6667 0.5000 0.3333 NA 0.5000 0.5000 NA 0.5000 NA
If we want to calculate some other kind of function, like the variance, we can use the apply
function:
apply(mydf, 1, var) # the `1` refers to rows
## [1] NA 0.3000 0.2667 0.2667 NA 0.2667 0.2667 0.2667 0.2667 0.2667
## [11] 0.3000 0.3000 0.3000 0.2667 NA 0.2667 0.1667 0.2667 NA 0.1667
## [21] NA 0.2667 0.3000 0.2667 NA 0.3000 0.3000 NA 0.3000 NA
We can also make calculations for columns (though this is less common in rectangular data unless we're trying to create summary statistics):
rowSums(mydf)
## [1] NA 3 4 2 NA 4 4 4 2 4 3 3 3 2 NA 4 5 4 NA 5 NA 4 3
## [24] 2 NA 3 3 NA 3 NA
rowMeans(mydf)
## [1] NA 0.5000 0.6667 0.3333 NA 0.6667 0.6667 0.6667 0.3333 0.6667
## [11] 0.5000 0.5000 0.5000 0.3333 NA 0.6667 0.8333 0.6667 NA 0.8333
## [21] NA 0.6667 0.5000 0.3333 NA 0.5000 0.5000 NA 0.5000 NA
apply(mydf, 2, var) # the `2` refers to columns
## x1 x2 x3 x4 x5 x6
## 0.2575 0.0000 0.2483 0.1437 0.0000 NA
sapply(mydf, var) # another way to apply a function to columns
## x1 x2 x3 x4 x5 x6
## 0.2575 0.0000 0.2483 0.1437 0.0000 NA
Sometimes we need to build a scale with a different formula for subsets of a dataset. For example, we want to calculate a scale in one way for men and a different way for women (or something like that). We can use indexing to achieve this. We can start by creating an empty variable with the right number of elements (i.e., the number of rows in our dataframe):
newvar <- numeric(nrow(mydf))
Then we can store values into this conditional on a variable from our dataframe:
newvar[mydf$x1 == 1] <- with(mydf[mydf$x1 == 1, ], x2 + x3)
newvar[mydf$x1 == 0] <- with(mydf[mydf$x1 == 0, ], x3 + x4 + x5)
The key to making that work is using the same index on the new variable as on the original data. Doing otherwise would produce a warning about mismatched lengths:
newvar[mydf$x1 == 1] <- with(mydf, x2 + x3)
## Warning: number of items to replace is not a multiple of replacement
## length