R Objects and Environment

One of the most confusing aspects of R for users of other statistical software is the idea that one can have any number of objects available in the R environment. One need not be constrained to a single rectangular dataset. This also means that it can be confusing to see what data is actually loaded into memory at any point in time. Here we discuss some tools for understanding the R working environment.

Listing objects

Let's start by clearing our workspace:

# rm(list=ls())

This option should be available in RGui under menu Miscellaneous > Remove all objects. Then create some R objects:

set.seed(1)
x <- rbinom(50, 1, 0.5)
y <- ifelse(x == 1, rnorm(sum(x == 1), 1, 1), rnorm(sum(!x == 1), 2, 1))
mydf <- data.frame(x = x, y = y)

Once we have a number of objects stored in memory, we can look at all of them using ls:

ls()
##   [1] "a"             "allout"        "amat"          "b"            
##   [5] "between"       "bmat"          "c"             "change"       
##   [9] "cmat"          "coef.mi"       "coefs.amelia"  "d"            
##  [13] "d2"            "df1"           "df2"           "e"            
##  [17] "e1"            "e2"            "e3"            "e4"           
##  [21] "englebert"     "f"             "FUN"           "g1"           
##  [25] "g2"            "grandm"        "grandse"       "grandvar"     
##  [29] "height"        "imp"           "imp.amelia"    "imp.mi"       
##  [33] "imp.mice"      "lm"            "lm.amelia.out" "lm.mi.out"    
##  [37] "lm.mice.out"   "lmfit"         "lmp"           "localfit"     
##  [41] "localp"        "logodds"       "logodds_lower" "logodds_se"   
##  [45] "logodds_upper" "m"             "m1"            "m2"           
##  [49] "m2a"           "m2b"           "m3a"           "m3b"          
##  [53] "me"            "me_se"         "means"         "mmdemo"       
##  [57] "mydf"          "myformula"     "n"             "newdata"      
##  [61] "newdata1"      "newdata2"      "newdf"         "newvar"       
##  [65] "out"           "p1"            "p2"            "p2a"          
##  [69] "p2b"           "p3a"           "p3b"           "p3b.fitted"   
##  [73] "part1"         "part2"         "pool.mice"     "ppcurve"      
##  [77] "s"             "s.amelia"      "s.mi"          "s.mice"       
##  [81] "s.orig"        "s.real"        "s2"            "search"       
##  [85] "ses"           "ses.amelia"    "tmpdf"         "tmpsplit"     
##  [89] "tr"            "w"             "weight"        "within"       
##  [93] "x"             "X"             "x1"            "x2"           
##  [97] "X2"            "x3"            "y"             "y1"           
## [101] "y1s"           "y2"            "y2s"           "y3"           
## [105] "y3s"           "z"             "z1"            "z2"

This shows us all of the objects that are currently saved. If we do another operation but do not save the result:

2 + 2
## [1] 4
ls()
##   [1] "a"             "allout"        "amat"          "b"            
##   [5] "between"       "bmat"          "c"             "change"       
##   [9] "cmat"          "coef.mi"       "coefs.amelia"  "d"            
##  [13] "d2"            "df1"           "df2"           "e"            
##  [17] "e1"            "e2"            "e3"            "e4"           
##  [21] "englebert"     "f"             "FUN"           "g1"           
##  [25] "g2"            "grandm"        "grandse"       "grandvar"     
##  [29] "height"        "imp"           "imp.amelia"    "imp.mi"       
##  [33] "imp.mice"      "lm"            "lm.amelia.out" "lm.mi.out"    
##  [37] "lm.mice.out"   "lmfit"         "lmp"           "localfit"     
##  [41] "localp"        "logodds"       "logodds_lower" "logodds_se"   
##  [45] "logodds_upper" "m"             "m1"            "m2"           
##  [49] "m2a"           "m2b"           "m3a"           "m3b"          
##  [53] "me"            "me_se"         "means"         "mmdemo"       
##  [57] "mydf"          "myformula"     "n"             "newdata"      
##  [61] "newdata1"      "newdata2"      "newdf"         "newvar"       
##  [65] "out"           "p1"            "p2"            "p2a"          
##  [69] "p2b"           "p3a"           "p3b"           "p3b.fitted"   
##  [73] "part1"         "part2"         "pool.mice"     "ppcurve"      
##  [77] "s"             "s.amelia"      "s.mi"          "s.mice"       
##  [81] "s.orig"        "s.real"        "s2"            "search"       
##  [85] "ses"           "ses.amelia"    "tmpdf"         "tmpsplit"     
##  [89] "tr"            "w"             "weight"        "within"       
##  [93] "x"             "X"             "x1"            "x2"           
##  [97] "X2"            "x3"            "y"             "y1"           
## [101] "y1s"           "y2"            "y2s"           "y3"           
## [105] "y3s"           "z"             "z1"            "z2"

This result is not visible with ls. Esssentially it disappears into the ether.

Viewing individual objects

Now we can look at any of these objects just by calling their name:

x
##  [1] 0 0 1 1 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 1 0 1 0 0 0 0 0 1 0 0 1 0 0 1
## [36] 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1
y
##  [1]  2.3411  0.8706 -0.4708  0.5218  1.6328  2.3587  0.8972  1.3877
##  [9]  0.9462  1.9608  2.6897  2.0280  0.9407  2.1888  1.7632  3.4656
## [17]  0.7466  1.6970  2.4755  0.3112  0.2925  1.0659  1.7685  2.3411
## [25]  0.8706  3.4330  3.9804  1.6328  0.8442  2.5697  1.8649  1.4179
## [33]  1.9608  2.6897  1.3877  0.9462 -0.3771  0.1950  0.6057  2.1533
## [41]  2.1000  1.7632  0.8355  0.7466  1.6970  1.5567  2.3411  0.8706
## [49]  1.3646  1.7685
mydf
##    x       y
## 1  0  2.3411
## 2  0  0.8706
## 3  1 -0.4708
## 4  1  0.5218
## 5  0  1.6328
## 6  1  2.3587
## 7  1  0.8972
## 8  1  1.3877
## 9  1  0.9462
## 10 0  1.9608
## 11 0  2.6897
## 12 0  2.0280
## 13 1  0.9407
## 14 0  2.1888
## 15 1  1.7632
## 16 0  3.4656
## 17 1  0.7466
## 18 1  1.6970
## 19 0  2.4755
## 20 1  0.3112
## 21 1  0.2925
## 22 0  1.0659
## 23 1  1.7685
## 24 0  2.3411
## 25 0  0.8706
## 26 0  3.4330
## 27 0  3.9804
## 28 0  1.6328
## 29 1  0.8442
## 30 0  2.5697
## 31 0  1.8649
## 32 1  1.4179
## 33 0  1.9608
## 34 0  2.6897
## 35 1  1.3877
## 36 1  0.9462
## 37 1 -0.3771
## 38 0  0.1950
## 39 1  0.6057
## 40 0  2.1533
## 41 1  2.1000
## 42 1  1.7632
## 43 1  0.8355
## 44 1  0.7466
## 45 1  1.6970
## 46 1  1.5567
## 47 0  2.3411
## 48 0  0.8706
## 49 1  1.3646
## 50 1  1.7685

The first two objects (x and y) are vectors, so they simply print to the console. The second object (mydf) is a dataframe, so its contents are printed as columns with row numbers. If we call one of the columns from the dataframe, it will look just like a vector:

mydf$x
##  [1] 0 0 1 1 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 1 0 1 0 0 0 0 0 1 0 0 1 0 0 1
## [36] 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1

This looks the same as just calling the x object and indeed they are the same:

mydf$x == x
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

But if we change one of the objects, it only affects the object we changed:

x <- rbinom(50, 1, 0.5)
mydf$x == x
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
## [12] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
## [23]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
## [34] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE
## [45] FALSE  TRUE  TRUE FALSE FALSE  TRUE

So by storing something new into x we change it, but not mydf$x because that's a different object.

Object class

We sometimes what to know what kind of object something is. We can see this with class:

class(x)
## [1] "integer"
class(y)
## [1] "numeric"
class(mydf)
## [1] "data.frame"

We can also use class on the columns of a dataframe:

class(mydf$x)
## [1] "integer"
class(mydf$y)
## [1] "numeric"

This is helpful, but it doesn't tell us a lot about the objects (i.e., it's not a very good summary). We can, however, see more detail using some other functions.

str

One way to get very detailed information about an object is with str (i.e, structure):

str(x)
##  int [1:50] 1 1 0 0 1 0 1 0 0 0 ...

This output tells us that this is an object of class “integer”, with length 50, and it shows the first few values.

str(y)
##  num [1:50] 2.341 0.871 -0.471 0.522 1.633 ...

This output tells us that this is an object of class “numeric”, with length 50, and it shows the first few values.

str(mydf)
## 'data.frame':    50 obs. of  2 variables:
##  $ x: int  0 0 1 1 0 1 1 1 1 0 ...
##  $ y: num  2.341 0.871 -0.471 0.522 1.633 ...

This output tells us that this is an object of class “data.frame”, with 50 observations on two variables. It then provides the same type of details for each variable that we would see by calling str(mydf$x), etc. directly. Using str on dataframes is therefore a very helpful and compact way to look at your data. More about this later.

summary

To see more details we may want to use some other functions. One particularly helpful function is summary, which provides some basic details about an object. For the two vectors, this will give us summary statistics.

summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    1.00    0.56    1.00    1.00
summary(y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -0.471   0.871   1.630   1.550   2.140   3.980

For the dataframe, it will give us summary statistics for everything in the dataframe:

summary(mydf)
##        x              y         
##  Min.   :0.00   Min.   :-0.471  
##  1st Qu.:0.00   1st Qu.: 0.871  
##  Median :1.00   Median : 1.633  
##  Mean   :0.54   Mean   : 1.549  
##  3rd Qu.:1.00   3rd Qu.: 2.140  
##  Max.   :1.00   Max.   : 3.980

Note how the printed information is the same but looks different. This is because R prints slightly different things depending on the class of the input object. If you want to look “under the hood”, you will see that summary is actually a set of multiple functions. When you type summary you see that R is calling a “method” depending on the class of the object. For our examples, the methods called are summary.default and summary.data.frame, which differ in what they print to the console for vectors and dataframes, respectively.

Conveniently, we can also save any output of a function as a new object. So here we can save the summary of x as a new object:

sx <- summary(x)

And do the same for mydf:

smydf <- summary(mydf)

We can then see that these new objects also have classes:

class(sx)
## [1] "summaryDefault" "table"
class(mydf)
## [1] "data.frame"

And, as you might be figuring out, an object's class determines how it is printed to the console. Again, looking “under the hood”, this is because there are separate print methods for each object class (see print.data.frame for how a dataframe is printed and print.table for how the summary of a dataframe is printed). This can create some confusion, though, because it means that what is printed is a reflection of the underlying object but is not actually the object. A bit existential, right? Because calling objects shows a printed rendition of an object, we can sometimes get confused about what that object actually is. This is where str can again be helpful:

str(sx)
## Classes 'summaryDefault', 'table'  Named num [1:6] 0 0 1 0.56 1 1
##   ..- attr(*, "names")= chr [1:6] "Min." "1st Qu." "Median" "Mean" ...
str(smydf)
##  'table' chr [1:6, 1:2] "Min.   :0.00  " "1st Qu.:0.00  " ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:6] "" "" "" "" ...
##   ..$ : chr [1:2] "      x" "      y"

Here we see that the summary of x and summary of mydf are both tables. summary(x) is a one-dimensional table, whereas summary(mydf) is a two-dimensional table (because it shows multiple variables). Because these objects are tables, it actually means we can index them like any other table:

sx[1]
## Min. 
##    0
sx[2:3]
## 1st Qu.  Median 
##       0       1
smydf[, 1]
##                                                                     
## "Min.   :0.00  " "1st Qu.:0.00  " "Median :1.00  " "Mean   :0.54  " 
##                                   
## "3rd Qu.:1.00  " "Max.   :1.00  "
smydf[1:3, ]
##        x                y           
##  "Min.   :0.00  " "Min.   :-0.471  "
##  "1st Qu.:0.00  " "1st Qu.: 0.871  "
##  "Median :1.00  " "Median : 1.633  "

This can be confusing because sx and smydf do not look like objects we can index, but that is because the way they are printed doesn't reflect the underlying structure of the objects.

Structure of other objects

It can be helpful to look at another example to see how what is printed can be confusing. Let's conduct a t-test on our data and see the result:

t.test(mydf$x, mydf$y)
## 
##  Welch Two Sample t-test
## 
## data:  mydf$x and mydf$y
## t = -6.745, df = 75.45, p-value = 2.714e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.3067 -0.7109
## sample estimates:
## mean of x mean of y 
##     0.540     1.549

The result is a bunch of details about the t.test. Like above, we can save this object:

myttest <- t.test(mydf$x, mydf$y)

Then we can call the object again whenever we want without repeating the calculation:

myttest
## 
##  Welch Two Sample t-test
## 
## data:  mydf$x and mydf$y
## t = -6.745, df = 75.45, p-value = 2.714e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.3067 -0.7109
## sample estimates:
## mean of x mean of y 
##     0.540     1.549

If we try to run summary on this, we get some weirdness:

summary(myttest)
##             Length Class  Mode     
## statistic   1      -none- numeric  
## parameter   1      -none- numeric  
## p.value     1      -none- numeric  
## conf.int    2      -none- numeric  
## estimate    2      -none- numeric  
## null.value  1      -none- numeric  
## alternative 1      -none- character
## method      1      -none- character
## data.name   1      -none- character

Because there is no method for summarizing a t.test. Why is this? It is because of the class and structure of our myttest object. Let's look:

class(myttest)
## [1] "htest"

This says it is of class “htest”. Not intuitive, but that's what it is.

str(myttest)
## List of 9
##  $ statistic  : Named num -6.75
##   ..- attr(*, "names")= chr "t"
##  $ parameter  : Named num 75.4
##   ..- attr(*, "names")= chr "df"
##  $ p.value    : num 2.71e-09
##  $ conf.int   : atomic [1:2] -1.307 -0.711
##   ..- attr(*, "conf.level")= num 0.95
##  $ estimate   : Named num [1:2] 0.54 1.55
##   ..- attr(*, "names")= chr [1:2] "mean of x" "mean of y"
##  $ null.value : Named num 0
##   ..- attr(*, "names")= chr "difference in means"
##  $ alternative: chr "two.sided"
##  $ method     : chr "Welch Two Sample t-test"
##  $ data.name  : chr "mydf$x and mydf$y"
##  - attr(*, "class")= chr "htest"

This is more interesting. The output tells us that myttest is a list of 9 objects. If we compare this to the output of myttest, we will see that when we call myttest, R is printing the underlying list in a pretty fashion for us. But because myttest is a list, it means that we can access any of the values in the list simply by calling them. So the list consists of statistic, parameter, p.value, etc. Let's look at some of them:

myttest$statistic
##      t 
## -6.745
myttest$p.value
## [1] 2.714e-09

The ability to extract these values from the underlying object (in addition to see them printed to the console in pretty form), means that we can easily use objects again and again to, e.g., combine results of multiple tests into a simplified table or use values from one test elsewhere in our analysis. As a simple example, let's compare the p-values of the same t.test under different hypotheses (two-sided, which is the default, and each of the one-sided alternatives):

myttest2 <- t.test(mydf$x, mydf$y, "greater")
myttest3 <- t.test(mydf$x, mydf$y, "less")
myttest$p.value
## [1] 2.714e-09
myttest2$p.value
## [1] 1
myttest3$p.value
## [1] 1.357e-09

This is much easier than having to copy and paste the p-value from each of the outputs and because these objects are stored in memory, we can access them at any point later in this session.