Dataframe Structure

Dataframes are integrally important to using R for any kind of data analysis. One of the most frustrating aspects of R for new users is that, unlike Excel, or even SPSS or Stata, it is not terribly easy to look at and modify data in a spreadsheet like format. In the tutorial on dataframes as a class, you should have learned a bit about what dataframes are and how to index and modify them. Here we are going to discuss how to look at dataframes in a variety of ways.

print, summary, and str

Looking at dataframes in R is actually pretty easy. Because a dataframe is an R object, we can simply print it to the console by calling its name. Let's create a dataframe and try this:

mydf <- data.frame(a = rbinom(100, 1, 0.5), b = rnorm(100), c = rnorm(100), 
    d = rnorm(100), e = sample(LETTERS, 100, TRUE))
mydf
##     a        b         c        d e
## 1   0 -0.65302  1.617287  0.35947 O
## 2   0 -1.56067  1.374269  1.16150 U
## 3   0 -0.88265  0.561109  1.50842 Y
## 4   0 -0.64753  1.414186 -1.33762 N
## 5   0 -0.94923 -0.964017 -2.29660 E
## 6   0  1.12688 -0.616431 -1.96846 F
## 7   0  1.72761  0.008532 -0.73825 W
## 8   1 -0.29763  1.572682 -0.19632 T
## 9   0 -0.24442  0.053971  2.59850 Z
## 10  0 -0.84921 -0.189399 -1.13353 K
## 11  0  0.11510 -0.043527 -1.89618 B
## 12  1  0.70786  0.024526 -1.08325 S
## 13  0 -0.92021 -3.408887 -0.70295 E
## 14  0  1.13397 -0.029900  0.55542 V
## 15  0  0.04453  0.373467 -0.61795 S
## 16  1  1.47634  0.944661 -0.36271 Z
## 17  0  1.62780 -0.603154 -0.07608 J
## 18  0  0.78341 -0.591424  0.36601 Q
## 19  0 -0.07220 -1.497778  0.70145 Y
## 20  0 -1.32925  1.888501  1.05821 J
## 21  0  1.08259 -2.293813  0.49702 C
## 22  1  0.73256 -0.552174 -0.72288 B
## 23  1 -0.30210  0.576488 -0.94125 R
## 24  1 -0.39293 -0.186210  0.82022 J
## 25  0 -2.64254  1.022059 -1.40601 T
## 26  0 -0.22410  1.673398 -2.00373 Z
## 27  0  1.95346 -1.285846  1.67366 V
## 28  0 -0.58287  0.930812 -1.99689 P
## 29  1  1.06114  0.512845 -0.96299 N
## 30  1  0.75882 -0.544033 -0.87342 Z
## 31  0  0.58825 -0.537684  0.27048 H
## 32  0 -0.43292  0.762805 -0.18115 V
## 33  0 -0.09822  0.144783 -1.51522 O
## 34  1  1.38690  0.202230 -0.92736 S
## 35  0 -1.31040 -1.456198  2.06663 W
## 36  0 -0.67905  1.053445  0.11093 G
## 37  0  1.20022 -0.397525  0.10330 Q
## 38  1  0.99828  0.810732 -0.43627 I
## 39  1 -0.55813  0.300040  0.82089 C
## 40  1  0.19107 -0.732265  1.64319 L
## 41  1 -0.93658 -0.803333  0.65210 R
## 42  0  1.71818 -0.259426  1.72735 L
## 43  0  0.79274 -1.577459 -2.33531 Y
## 44  1 -0.17978  0.387909 -0.04763 T
## 45  0 -1.27127 -0.731157 -0.23587 J
## 46  1  0.36220 -1.182620 -1.58457 H
## 47  1  2.26727  1.503092 -1.20872 D
## 48  0 -0.56679 -1.205823  0.30645 Q
## 49  0  1.18184  0.274242 -0.25508 H
## 50  0 -0.43997 -1.203856  0.03733 Z
## 51  0 -0.21525  0.175392  1.54721 T
## 52  0  0.17862  2.041101  0.48442 Y
## 53  1  2.82008  1.209535 -0.67040 X
## 54  0 -0.02909 -0.379774  0.13640 X
## 55  1 -0.52543 -0.976383 -0.44816 W
## 56  1  0.92736 -0.066320 -1.38853 A
## 57  0  0.81235 -1.163808  0.02140 N
## 58  1 -1.63686 -0.670042 -0.55861 P
## 59  1 -1.45887 -0.257498  0.66978 K
## 60  1  0.36716  0.092494 -0.59397 M
## 61  1  0.50476 -1.691161  0.13602 U
## 62  1 -0.53350 -0.781128  0.39872 T
## 63  0  0.13419 -1.218642  0.43340 X
## 64  0  0.68213 -0.262076 -0.57323 U
## 65  0 -2.09181  1.600879  0.16202 L
## 66  1 -1.35759  0.271196 -1.45684 R
## 67  0 -0.64975  0.404372 -0.44506 V
## 68  0 -0.33656 -0.662692  0.20784 R
## 69  1 -1.19379 -1.547217 -1.40629 Y
## 70  1  0.48648 -1.117218 -0.12517 R
## 71  0 -1.03210 -0.369793 -0.74953 X
## 72  0  0.34542  0.494358 -1.19533 Z
## 73  1  0.41408  0.264469 -2.49834 O
## 74  1 -0.20288 -0.076575  0.29039 X
## 75  0 -0.18147  0.019607 -1.31953 K
## 76  0 -0.57495  0.778011 -2.20197 I
## 77  1 -1.69877  0.636596 -0.33592 L
## 78  1 -2.07330  1.766734  2.43636 C
## 79  0  0.29462 -0.991969 -0.66017 B
## 80  1  0.29372 -0.573212  0.46335 C
## 81  0  0.85411 -0.371477 -0.06186 W
## 82  1  0.70678  0.274230  0.14330 K
## 83  0 -0.86584  0.313496 -0.82688 W
## 84  1  0.84311 -1.478058  0.25956 S
## 85  0 -1.11050 -0.501903 -2.30398 H
## 86  1  0.23547  2.010354 -0.88391 R
## 87  1  0.04245 -0.928369 -0.75509 U
## 88  0  1.09768 -1.806275 -0.64789 B
## 89  0 -0.85865  1.339204  0.42920 W
## 90  1  0.49483  1.133309  0.51501 W
## 91  0 -2.17343 -1.207055 -0.43024 D
## 92  0  1.56411  0.560760  1.52356 Y
## 93  1  0.23590 -1.444402 -0.48720 Y
## 94  0 -0.58226 -0.188818 -0.26365 W
## 95  1  0.33818 -0.462813 -0.65003 P
## 96  1 -0.25738  1.953699  1.68336 O
## 97  0  1.15532 -0.168700 -0.48666 S
## 98  0 -0.88605 -0.596704 -0.39284 D
## 99  0  1.03949  0.944495  0.02210 K
## 100 0  0.47307 -0.616859  0.72329 M

This output is fine but kind of inconvenient. It doesn't fit on one screen, we can't modify anything, and - if we had more variables and/or more observations - it would be pretty difficult to anything in this way. Note: Calling the dataframe by name is the same as print-ing it. So mydf is the same as print(mydf). As we already know, we can use summary to see a more compact version of the dataframe:

summary(mydf)
##        a             b                 c                d         
##  Min.   :0.0   Min.   :-2.6425   Min.   :-3.409   Min.   :-2.498  
##  1st Qu.:0.0   1st Qu.:-0.6506   1st Qu.:-0.731   1st Qu.:-0.876  
##  Median :0.0   Median : 0.0067   Median :-0.123   Median :-0.259  
##  Mean   :0.4   Mean   : 0.0081   Mean   :-0.072   Mean   :-0.231  
##  3rd Qu.:1.0   3rd Qu.: 0.7650   3rd Qu.: 0.565   3rd Qu.: 0.406  
##  Max.   :1.0   Max.   : 2.8201   Max.   : 2.041   Max.   : 2.599  
##                                                                   
##        e     
##  W      : 8  
##  Y      : 7  
##  R      : 6  
##  Z      : 6  
##  K      : 5  
##  S      : 5  
##  (Other):63

Now, instead of all the data, we see a five-number summary of the data for numeric or integer variables and a tabulation of mydf$e, which is a factor variable (you can confirm this with class(mydf$e)). We can also use str to see a different kind of compact summary:

str(mydf)
## 'data.frame':    100 obs. of  5 variables:
##  $ a: int  0 0 0 0 0 0 0 1 0 0 ...
##  $ b: num  -0.653 -1.561 -0.883 -0.648 -0.949 ...
##  $ c: num  1.617 1.374 0.561 1.414 -0.964 ...
##  $ d: num  0.359 1.161 1.508 -1.338 -2.297 ...
##  $ e: Factor w/ 26 levels "A","B","C","D",..: 15 21 25 14 5 6 23 20 26 11 ...

This output has the advantage of additionally showing variable classes and the first few values of each variable, but doesn't provide a numeric summary of the data. Thus summary and str complement each other rather than provide duplicate information. Remember, too, that dataframes also carry a “names” attribute, so we can see just the names of our variables using:

names(mydf)
## [1] "a" "b" "c" "d" "e"

This is very important for when a dataframe is very wide (i.e., has large numbers of variables) because even the compact output of summary and str can become unwieldy with more than 20 or so variables.

head and tail

Two frequently neglected functions in R are head and tail. These offer exactly what their names suggest, the top and bottom few values of an object:

head(mydf)
##   a       b       c       d e
## 1 0 -0.6530  1.6173  0.3595 O
## 2 0 -1.5607  1.3743  1.1615 U
## 3 0 -0.8826  0.5611  1.5084 Y
## 4 0 -0.6475  1.4142 -1.3376 N
## 5 0 -0.9492 -0.9640 -2.2966 E
## 6 0  1.1269 -0.6164 -1.9685 F

Note the similarly between these values and those reported in str(mydf).

tail(mydf)
##     a       b       c       d e
## 95  1  0.3382 -0.4628 -0.6500 P
## 96  1 -0.2574  1.9537  1.6834 O
## 97  0  1.1553 -0.1687 -0.4867 S
## 98  0 -0.8861 -0.5967 -0.3928 D
## 99  0  1.0395  0.9445  0.0221 K
## 100 0  0.4731 -0.6169  0.7233 M

Both head and tail accept an additional argument referring to how many values to display:

head(mydf, 2)
##   a      b     c      d e
## 1 0 -0.653 1.617 0.3595 O
## 2 0 -1.561 1.374 1.1615 U
head(mydf, 15)
##    a        b         c       d e
## 1  0 -0.65302  1.617287  0.3595 O
## 2  0 -1.56067  1.374269  1.1615 U
## 3  0 -0.88265  0.561109  1.5084 Y
## 4  0 -0.64753  1.414186 -1.3376 N
## 5  0 -0.94923 -0.964017 -2.2966 E
## 6  0  1.12688 -0.616431 -1.9685 F
## 7  0  1.72761  0.008532 -0.7382 W
## 8  1 -0.29763  1.572682 -0.1963 T
## 9  0 -0.24442  0.053971  2.5985 Z
## 10 0 -0.84921 -0.189399 -1.1335 K
## 11 0  0.11510 -0.043527 -1.8962 B
## 12 1  0.70786  0.024526 -1.0832 S
## 13 0 -0.92021 -3.408887 -0.7029 E
## 14 0  1.13397 -0.029900  0.5554 V
## 15 0  0.04453  0.373467 -0.6180 S

These functions are therefore very helpful for looking quickly at a dataframe. They can also be applied to individual variables inside of a dataframe:

head(mydf$a)
## [1] 0 0 0 0 0 0
tail(mydf$e)
## [1] P O S D K M
## Levels: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

edit and fix

R provides two ways to edit an R dataframe (or matrix) in a spreadsheet like fashion. They look the same, but are different! Both can be used to look at data in a spreadsheet-like way, but editing with them produces drastically different results. Note: One point of confusion is that calling edit or fix on a non-dataframe object opens a completely different text editing window that can be used to modify vectors, functions, etc. If you try to edit or fix something and don't see a spreadsheet, the object you're trying to edit is not rectangular (i.e., not a dataframe or matrix).

edit

The first of these is edit, which opens an R dataframe as a spreadsheet. The data can then be directly edited. When the spreadsheet window is closed, the resulting dataframe is returned to the user (and printed to the console). This is a reminder that it didn't actually change the mydf object. In other words, when we edit a dataframe, we are actually copying the dataframe, changing its values, and then returning it to the console. The original mydf is unchanged. If we want to use this modified dataframe, we need to save it as a new R object.

fix

The second data editing function is fix. This is probably the more intuitive function. Like edit, fix opens the spreadsheet editor. But, when the window is closed, the result is used to replace the dataframe. Thus fix(mydf) replaces mydf with the edited data.

edit and fix can seem like a good idea. And if they are used simply to look at data, they're a great additional tool (along with summary, str, head, tail, and indexing). But (!!!!) using edit and fix are non-reproducible ways of conducting data analysis. If we want to replace values in a dataframe, it is better (from the perspective of reproducible science) to write out the code to perform those replacements so that you or someone else can use them in the future to achieve the same results. So, in short, use edit and fix, but don't abuse them.