Scale construction

One of the most common analytic tasks is creating variables. For example, we have some variable that we need to use in the analysis, but we want it to have a mean of zero or be confined to [0,1]. Alternatively, we might have a large number of indicators that we need to aggregate into a single variable. When we used R as a calculator, we learned that R is “vectorized”. This means that when we call a function like add (+), it adds each respective element of two vectors together. For example:

(1:3) + (10:12)

## [1] 11 13 15

This returns a three-element vector that added each corresponding element of the two vectors together. We also should remember R's tendency to use “recyling”:

(1:3) + 10

## [1] 11 12 13

Here, the second vector only has one element, so R assumes that you want to add 10 to each element of the first vector (as opposed to adding 10 to the first element and nothing to the second and third elements). This is really helpful for preparing data vectors because it means we can use mathematical operators (addition, subtraction, multiplication, division, powers, logs, etc.) for their intuitive purposes when trying to create new variables rather than having to rely on obscure function names. But R also has a number of other functions for building variables.

Let's examine all of these features using some made-up data. In this case, we'll create a dataframe of indicator variables (coded 0 and 1) and build them into various scales.

set.seed(1)
n <- 30
mydf <- data.frame(x1 = rbinom(n, 1, 0.5), x2 = rbinom(n, 1, 0.1), x3 = rbinom(n, 
    1, 0.5), x4 = rbinom(n, 1, 0.8), x5 = 1, x6 = sample(c(0, 1, NA), n, TRUE))

Let's use str and summary to get a quick sense of the data:

str(mydf)

## 'data.frame':    30 obs. of  6 variables:
##  $ x1: int  0 0 1 1 0 1 1 1 1 0 ...
##  $ x2: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ x3: int  1 0 0 0 1 0 0 1 0 1 ...
##  $ x4: int  1 1 1 0 1 1 1 1 0 1 ...
##  $ x5: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ x6: num  NA 1 1 0 NA 1 1 0 0 1 ...

summary(mydf)

##        x1              x2          x3            x4              x5   
##  Min.   :0.000   Min.   :0   Min.   :0.0   Min.   :0.000   Min.   :1  
##  1st Qu.:0.000   1st Qu.:0   1st Qu.:0.0   1st Qu.:1.000   1st Qu.:1  
##  Median :0.000   Median :0   Median :0.0   Median :1.000   Median :1  
##  Mean   :0.467   Mean   :0   Mean   :0.4   Mean   :0.833   Mean   :1  
##  3rd Qu.:1.000   3rd Qu.:0   3rd Qu.:1.0   3rd Qu.:1.000   3rd Qu.:1  
##  Max.   :1.000   Max.   :0   Max.   :1.0   Max.   :1.000   Max.   :1  
##                                                                       
##        x6       
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :1.000  
##  Mean   :0.591  
##  3rd Qu.:1.000  
##  Max.   :1.000  
##  NA's   :8

All variables are coded 0 or 1, x5 is all 1's, and x6 contains some missing data (NA) values.

Simple scaling

The easiest scales are those that add or substract variables. Let's try that quick:

mydf$x1 + mydf$x2

##  [1] 0 0 1 1 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 1 0 1 0 0 0 0 0 1 0

mydf$x1 + mydf$x2 + mydf$x3

##  [1] 1 0 1 1 1 1 1 2 1 1 0 1 1 0 1 1 2 1 1 2 1 1 1 0 1 0 1 0 1 0

mydf$x1 + mydf$x2 - mydf$x3

##  [1] -1  0  1  1 -1  1  1  0  1 -1  0 -1  1  0  1 -1  0  1 -1  0  1 -1  1
## [24]  0 -1  0 -1  0  1  0

One way to save some typing is to use the with command, which simply tells R which dataframe to look in for variables:

with(mydf, x1 + x2 - x3)

##  [1] -1  0  1  1 -1  1  1  0  1 -1  0 -1  1  0  1 -1  0  1 -1  0  1 -1  1
## [24]  0 -1  0 -1  0  1  0

A faster way to take a rowsum is to use rowSums:

rowSums(mydf)

##  [1] NA  3  4  2 NA  4  4  4  2  4  3  3  3  2 NA  4  5  4 NA  5 NA  4  3
## [24]  2 NA  3  3 NA  3 NA

Because we have missing data, any row that has an NA results in a sum of 0. We could either skip that column:

rowSums(mydf[, 1:5])

##  [1] 3 2 3 2 3 3 3 4 2 3 2 3 3 1 3 3 4 3 2 4 2 3 3 2 3 2 3 2 3 2

or use the na.rm=TRUE argument to skip NA values when calculating the sum:

rowSums(mydf, na.rm = TRUE)

##  [1] 3 3 4 2 3 4 4 4 2 4 3 3 3 2 3 4 5 4 2 5 2 4 3 2 3 3 3 2 3 2

or we could look at a reduced dataset, eliminating all rows from the result that have a missing value:

rowSums(na.omit(mydf))

##  2  3  4  6  7  8  9 10 11 12 13 14 16 17 18 20 22 23 24 26 27 29 
##  3  4  2  4  4  4  2  4  3  3  3  2  4  5  4  5  4  3  2  3  3  3

but this last option can create problems if we try to store the result back into our original data (since it has fewer elements than the original dataframe has rows).

We can also multiply (or divide) across variables. For these indicator variables, that applies an AND logic to tell us if all of the variables are 1:

with(mydf, x3 * x4 * x5)

##  [1] 1 0 0 0 1 0 0 1 0 1 0 1 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0

We might also want to take an average value across all the columns, which we could do by hand:

with(mydf, x1 + x2 + x3 + x4 + x5 + x6)/6

##  [1]     NA 0.5000 0.6667 0.3333     NA 0.6667 0.6667 0.6667 0.3333 0.6667
## [11] 0.5000 0.5000 0.5000 0.3333     NA 0.6667 0.8333 0.6667     NA 0.8333
## [21]     NA 0.6667 0.5000 0.3333     NA 0.5000 0.5000     NA 0.5000     NA

or use the rowSums function from earlier:

rowSums(mydf)/6

##  [1]     NA 0.5000 0.6667 0.3333     NA 0.6667 0.6667 0.6667 0.3333 0.6667
## [11] 0.5000 0.5000 0.5000 0.3333     NA 0.6667 0.8333 0.6667     NA 0.8333
## [21]     NA 0.6667 0.5000 0.3333     NA 0.5000 0.5000     NA 0.5000     NA

or use the even simpler rowMeans function:

rowMeans(mydf)

##  [1]     NA 0.5000 0.6667 0.3333     NA 0.6667 0.6667 0.6667 0.3333 0.6667
## [11] 0.5000 0.5000 0.5000 0.3333     NA 0.6667 0.8333 0.6667     NA 0.8333
## [21]     NA 0.6667 0.5000 0.3333     NA 0.5000 0.5000     NA 0.5000     NA

If we want to calculate some other kind of function, like the variance, we can use the apply function:

apply(mydf, 1, var)  # the `1` refers to rows

##  [1]     NA 0.3000 0.2667 0.2667     NA 0.2667 0.2667 0.2667 0.2667 0.2667
## [11] 0.3000 0.3000 0.3000 0.2667     NA 0.2667 0.1667 0.2667     NA 0.1667
## [21]     NA 0.2667 0.3000 0.2667     NA 0.3000 0.3000     NA 0.3000     NA

We can also make calculations for columns (though this is less common in rectangular data unless we're trying to create summary statistics):

rowSums(mydf)

##  [1] NA  3  4  2 NA  4  4  4  2  4  3  3  3  2 NA  4  5  4 NA  5 NA  4  3
## [24]  2 NA  3  3 NA  3 NA

rowMeans(mydf)

##  [1]     NA 0.5000 0.6667 0.3333     NA 0.6667 0.6667 0.6667 0.3333 0.6667
## [11] 0.5000 0.5000 0.5000 0.3333     NA 0.6667 0.8333 0.6667     NA 0.8333
## [21]     NA 0.6667 0.5000 0.3333     NA 0.5000 0.5000     NA 0.5000     NA

apply(mydf, 2, var)  # the `2` refers to columns

##     x1     x2     x3     x4     x5     x6 
## 0.2575 0.0000 0.2483 0.1437 0.0000     NA

sapply(mydf, var)  # another way to apply a function to columns

##     x1     x2     x3     x4     x5     x6 
## 0.2575 0.0000 0.2483 0.1437 0.0000     NA

Using indexing in building scales

Sometimes we need to build a scale with a different formula for subsets of a dataset. For example, we want to calculate a scale in one way for men and a different way for women (or something like that). We can use indexing to achieve this. We can start by creating an empty variable with the right number of elements (i.e., the number of rows in our dataframe):

newvar <- numeric(nrow(mydf))

Then we can store values into this conditional on a variable from our dataframe:

newvar[mydf$x1 == 1] <- with(mydf[mydf$x1 == 1, ], x2 + x3)
newvar[mydf$x1 == 0] <- with(mydf[mydf$x1 == 0, ], x3 + x4 + x5)

The key to making that work is using the same index on the new variable as on the original data. Doing otherwise would produce a warning about mismatched lengths:

newvar[mydf$x1 == 1] <- with(mydf, x2 + x3)

## Warning: number of items to replace is not a multiple of replacement
## length