Loading Data

In order to use R for data analysis, we need to get our data into R. Unfortunately, because R lacks a graphical user interface, loading data is not particularly intuitive for those used to working with other statistical software. This tutorial explains how to load data into R as a dataframe object.

General Notes

As a preliminary note, one of the things about R that causes a fair amount of confusion is that R reads character data, by default, as factor. In other words, when your data contain alphanumeric character strings (e.g., names of countries, free response survey questions), R will read those data in as factor variables rather than character variables. This can be changed when reading in data using almost any of the following techniques by setting a stringsAsFactors=FALSE argument.

A second point of difficulty for beginners to R is that R offers no obvious visual way to load data into R. Lacking a full graphical user interface, there is no “open” button to read in a dataset. The closest thing to this is the file.choose function. If you don't know the name or location of a file you want to load, you can use file.choose() to open a dialog window that will let you select a file. The response, however, is just a character string containing the name and full path of the file. No action is taken with regard to that file. If, for example, you want to load a comma-separated value file (described below), you could make a call like the following:

# read.csv(file.choose())

This will first open the file choose dialog window and, when you select a file, R will then process that file with read.csv and return a dataframe. While file.choose is a convenient function for interactively working with R. It is generally better to manually write filenames into your code to maximize reproducibility.

Built-in Data

One of the neat little features of R is that it comes with some built-in datasets, and many add-on packages supply additional datasets to demonstrate their functionality. We can access these datasets with the data() function. Here we'll just print the first few datasets:

head(data()$results)

##      Package LibPath                              Item       
## [1,] "car"   "C:/Program Files/R/R-3.0.2/library" "AMSsurvey"
## [2,] "car"   "C:/Program Files/R/R-3.0.2/library" "Adler"    
## [3,] "car"   "C:/Program Files/R/R-3.0.2/library" "Angell"   
## [4,] "car"   "C:/Program Files/R/R-3.0.2/library" "Anscombe" 
## [5,] "car"   "C:/Program Files/R/R-3.0.2/library" "Baumann"  
## [6,] "car"   "C:/Program Files/R/R-3.0.2/library" "Bfox"     
##      Title                                        
## [1,] "American Math Society Survey Data"          
## [2,] "Experimenter Expectations"                  
## [3,] "Moral Integration of American Cities"       
## [4,] "U. S. State Public-School Expenditures"     
## [5,] "Methods of Teaching Reading Comprehension"  
## [6,] "Canadian Women's Labour-Force Participation"

Datasets in the dataset package are pre-loaded with R and can simply be called by name from the R console. For example, we can see the “Monthly Airline Passenger Numbers 1949-1960” dataset by simply calling:

AirPassengers

##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432

To obtain detailed information about the datasets, you can just access the dataset documention: ? AirPassengers. We generally want to work with our own data, however, rather than some arbitrary dataset, so we'll have to load data into R.

Manual data entry

Because a dataframe is just a collection of data vectors, we can always enter data by hand into the R console. For example, let's say we have two variables (height and weight) measured on each of six observations. We can enter these simply by typing them into the console and combining them into a dataframe, like:

height <- c(165, 170, 163, 182, 175, 190)
weight <- c(45, 60, 70, 80, 63, 72)
mydf <- cbind.data.frame(height, weight)

We can then call our dataframe by name:

mydf

##   height weight
## 1    165     45
## 2    170     60
## 3    163     70
## 4    182     80
## 5    175     63
## 6    190     72

R also provides a function called scan that allows us to type data into a special prompt. For example, we might want to read in six values of gender for our observations above and we could do that by typing mydf$gender <- scan(n=6, what="numeric") and entering the six values, one per line when prompted. But entering data manually in this fashion is inefficient and doesn't make sense if we already have data saved in an external file.

Loading tabular data

The easiest data to load into R comes in tabular file formats, like comma-separated value (CSV) or tab-separated value (TSV) files. These can easily be created using a spreadsheet editor (like Microsoft Excel), a text editor (like Notepad), or exported from many other computer programs (including all statistical packages).

`read.table` and its variants

The general function for reading these kinds of data is called read.table. Two other functions, read.csv and read.delim, provide convenient wrappers for reading CSV and TSV files, respectively. (Note: read.csv2 and read.delim2 provide slightly different wrappers designed for reading data that uses a semicolon rather than comma separator and a comma rather than a period as the decimal point.) Reading in data that is in CSV format is easy. For example, let's read in the following file, which contains some data about patient admissions for five patients:

patient,dob,entry,discharge,fee,sex
001,10/21/1946,12/12/2004,12/14/2004,8000,1
002,05/01/1980,07/08/2004,08/08/2004,12000,2
003,01/01/1960,01/01/2004,01/04/2004,9000,2
004,06/23/1998,11/11/2004,12/25/2004,15123,1

We can read these data in from from the console by copying and pasting them into a command like the following:

mydf <- read.csv(text = "\npatient,dob,entry,discharge,fee,sex\n001,10/21/1946,12/12/2004,12/14/2004,8000,1\n002,05/01/1980,07/08/2004,08/08/2004,12000,2\n003,01/01/1960,01/01/2004,01/04/2004,9000,2\n004,06/23/1998,11/11/2004,12/25/2004,15123,1")
mydf

##   patient        dob      entry  discharge   fee sex
## 1       1 10/21/1946 12/12/2004 12/14/2004  8000   1
## 2       2 05/01/1980 07/08/2004 08/08/2004 12000   2
## 3       3 01/01/1960 01/01/2004 01/04/2004  9000   2
## 4       4 06/23/1998 11/11/2004 12/25/2004 15123   1

Or, we can read them from the local file directly:

mydf <- read.csv("../Data/patient.csv")

Reading them in either way will produce the exact same dataframe. If the data were tab- or semicolon-separated, the call would be exactly the same except for the use of read.delim and read.csv2, respectively.

Note: Any time we read data into R, we need to store it as a variable, otherwise it will simply be printed to the console and we won't be able to do anything with it. You can name dataframes whatever you want.

`scan` and `readLines`

Occasionally, we need to read in data as a vector of character strings rather than as delimited data to make a dataframe. For example, we might have a file that contains textual data (e.g., from a news story) and we want to read in each word or each line of the file as a separate element of a vector in order to perform some kind of text processing on it. To do this kind of analysis we can use one of two functions. The scan function we used above to manually enter data at the console can also be used to read data in from a file, as can another function called readLines. We can see how the two functions work by first writing some miscellaneous text to a file (using cat) and then reading in that content:

cat("TITLE", "A first line of text", "A second line of text", "The last line of text", 
    file = "ex.data", sep = "\n")

We can use scan to read in the data as a vector of words:

scan("ex.data", what = "character")

##  [1] "TITLE"  "A"      "first"  "line"   "of"     "text"   "A"     
##  [8] "second" "line"   "of"     "text"   "The"    "last"   "line"  
## [15] "of"     "text"

The scan function accepts additional arguments such n to specify the number of lines to read from the file and sep to specify how to divide the file into separate entries in the resulting vector:

scan("ex.data", what = "character", sep = "\n")

## [1] "TITLE"                 "A first line of text"  "A second line of text"
## [4] "The last line of text"

scan("ex.data", what = "character", n = 1, sep = "\n")

## [1] "TITLE"

We can do the same thing with readLines, which assumes that we want to read each line as a complete string rather than separating the file contents in some way:

readLines("ex.data")

## [1] "TITLE"                 "A first line of text"  "A second line of text"
## [4] "The last line of text"

It also accepts an n argument:

readLines("ex.data", n = 2)

## [1] "TITLE"                "A first line of text"

Let's delete the file we created just to cleanup:

unlink("ex.data")  # tidy up

Reading .RData data

R has its own fill format called .RData that can be used to store data for use in R. It is fairly rare to encounter data in this format, but reading it into R is - as one might expect - very easy. You simply need to call load('thefile.RData') and the objects stored in the file will be loaded into memory in R. One context in which you might use an .RData file is when saving your R workspace. When you quite R (using q()), R asks if you want to save your workspace. If you select “yes”, R stores all of the objects currently in memory to a .RData file. This file can then be loaded in a subsequent R session to pick up quite literally exactly where you left off when you saved the file.

Loading “Foreign” data

Because many people use statistical packages like SAS, SPSS, and Stata for statistical analysis, much of the data available in the world is saved in proprietary file formats created and owned by the the companies that publish that software. This is bad because those data formats are deprecated (i.e., made irrelevant) quite often (e.g., when Stata upgraded to version 11, it introduced a new file format and its older file formats were no longer compatible with the newest version of the software). This creates problems for reproducibility because not everyone has access to Stata (or to SPSS or SAS) and storing data in these formats makes it harder to share data and ties data to specific software owned by specific companies. Editorializing aside, R can import data from a variety of proprietary file formats. Doing so requires one of the recommended add-on packages called foreign. Let's load it here:

library(foreign)

The foreign package can be used to import data from a variety of proprietary formats, including Stata .dta formats (using the read.dta function), Octave or Matlab .mat formats (using read.octave), SPSS .sav formats (usingread.spss), SAS permanent .sas7bdat formats (usingread.ssd) and SAS XPORT .stx or .xpt formats (usingread.xport), Systat .syd formats (usingread.systat), and Minitab .tmp formats (usingread.mtp). Note: The **foreign** package sometimes has trouble with SPSS formats, but these files can also be opened with thespss.getfunction from the **Hmisc** package or one of several functions from the **memisc** package (spss.fixed.file,spss.portable.file, andspss.system.file). We can try loading some “foreign” data stored in Stata format:

englebert <- read.dta("../Data/EnglebertPRQ2000.dta")

## Warning: cannot read factor labels from Stata 5 files

We can then look at the loaded data using any of our usual object examination functions:

dim(englebert)  # dimensions

## [1] 50 27

head(englebert)  # first few rows

##        country wbcode indep paris london brussels lisbon commit exprop
## 1       ANGOLA    AGO  1975     0      0        0      1  3.820   5.36
## 2        BENIN    BEN  1960     1      0        0      0  4.667   6.00
## 3     BOTSWANA    BWA  1966     0      1        0      0  6.770   7.73
## 4 BURKINA FASO    BFA  1960     1      0        0      0  5.000   4.45
## 5      BURUNDI    BDI  1962     0      0        1      0  6.667   7.00
## 6     CAMEROON    CMR  1960     1      0        0      0  6.140   6.45
##   corrupt instqual buroqual goodgov ruleolaw pubadmin     growth  lcon
## 1   5.000   2.7300    4.470   4.280    3.970     4.73 -0.0306405 6.594
## 2   1.333   3.0000    2.667   3.533    4.556     2.00 -0.0030205 6.949
## 3   6.590   8.3300    6.140   7.110    7.610     6.36  0.0559447 6.358
## 4   6.060   5.3000    4.170   5.000    4.920     5.11 -0.0000589 6.122
## 5   3.000   0.8333    4.000   4.300    4.833     3.50 -0.0036746 6.461
## 6   4.240   4.5500    6.670   5.610    5.710     5.45  0.0147910 6.463
##   lconsq      i     g vlegit hlegit  elf hieafvm hieafvs warciv language
## 1  43.49  3.273 34.22      0 0.5250 0.78    1.00    0.00     24      4.2
## 2  48.29  6.524 22.79      0 0.6746 0.62    2.67    0.47      0      5.3
## 3  40.42 22.217 27.00      1 0.9035 0.51    2.00    0.00      0      3.1
## 4  37.48  7.858 17.86      0 0.5735 0.68    1.25    0.97      0      4.8
## 5  41.75  4.939 13.71      1 0.9800 0.04    3.00    0.00      8      0.6
## 6  41.77  8.315 20.67      0 0.8565 0.89    1.50    0.76      0      8.3

names(englebert)  # column/variable names

##  [1] "country"  "wbcode"   "indep"    "paris"    "london"   "brussels"
##  [7] "lisbon"   "commit"   "exprop"   "corrupt"  "instqual" "buroqual"
## [13] "goodgov"  "ruleolaw" "pubadmin" "growth"   "lcon"     "lconsq"  
## [19] "i"        "g"        "vlegit"   "hlegit"   "elf"      "hieafvm" 
## [25] "hieafvs"  "warciv"   "language"

str(englebert)  # object structure

## 'data.frame':    50 obs. of  27 variables:
##  $ country : chr  "ANGOLA" "BENIN" "BOTSWANA" "BURKINA FASO" ...
##  $ wbcode  : chr  "AGO" "BEN" "BWA" "BFA" ...
##  $ indep   : num  1975 1960 1966 1960 1962 ...
##  $ paris   : num  0 1 0 1 0 1 0 1 1 1 ...
##  $ london  : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ brussels: num  0 0 0 0 1 0 0 0 0 0 ...
##  $ lisbon  : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ commit  : num  3.82 4.67 6.77 5 6.67 ...
##  $ exprop  : num  5.36 6 7.73 4.45 7 ...
##  $ corrupt : num  5 1.33 6.59 6.06 3 ...
##  $ instqual: num  2.73 3 8.33 5.3 0.833 ...
##  $ buroqual: num  4.47 2.67 6.14 4.17 4 ...
##  $ goodgov : num  4.28 3.53 7.11 5 4.3 ...
##  $ ruleolaw: num  3.97 4.56 7.61 4.92 4.83 ...
##  $ pubadmin: num  4.73 2 6.36 5.11 3.5 ...
##  $ growth  : num  -3.06e-02 -3.02e-03 5.59e-02 -5.89e-05 -3.67e-03 ...
##  $ lcon    : num  6.59 6.95 6.36 6.12 6.46 ...
##  $ lconsq  : num  43.5 48.3 40.4 37.5 41.8 ...
##  $ i       : num  3.27 6.52 22.22 7.86 4.94 ...
##  $ g       : num  34.2 22.8 27 17.9 13.7 ...
##  $ vlegit  : num  0 0 1 0 1 0 1 0 0 0 ...
##  $ hlegit  : num  0.525 0.675 0.904 0.573 0.98 ...
##  $ elf     : num  0.78 0.62 0.51 0.68 0.04 ...
##  $ hieafvm : num  1 2.67 2 1.25 3 ...
##  $ hieafvs : num  0 0.47 0 0.97 0 ...
##  $ warciv  : num  24 0 0 0 8 0 0 0 29 0 ...
##  $ language: num  4.2 5.3 3.1 4.8 0.6 ...
##  - attr(*, "datalabel")= chr ""
##  - attr(*, "time.stamp")= chr "25 Mar 2000 18:07"
##  - attr(*, "formats")= chr  "%21s" "%9s" "%9.0g" "%9.0g" ...
##  - attr(*, "types")= int  148 133 102 102 102 102 102 102 102 102 ...
##  - attr(*, "val.labels")= chr  "" "" "" "" ...
##  - attr(*, "var.labels")= chr  "Name of country" "World Bank three-letter code" "Date of independence" "Colonization by France" ...
##  - attr(*, "version")= int 5

summary(englebert)  # summary

##    country             wbcode              indep          paris     
##  Length:50          Length:50          Min.   :  -4   Min.   :0.00  
##  Class :character   Class :character   1st Qu.:1960   1st Qu.:0.00  
##  Mode  :character   Mode  :character   Median :1962   Median :0.00  
##                                        Mean   :1921   Mean   :0.38  
##                                        3rd Qu.:1968   3rd Qu.:1.00  
##                                        Max.   :1993   Max.   :1.00  
##                                        NA's   :2                    
##      london        brussels        lisbon        commit         exprop    
##  Min.   :0.00   Min.   :0.00   Min.   :0.0   Min.   :1.68   Min.   :2.00  
##  1st Qu.:0.00   1st Qu.:0.00   1st Qu.:0.0   1st Qu.:4.00   1st Qu.:4.50  
##  Median :0.00   Median :0.00   Median :0.0   Median :5.00   Median :6.05  
##  Mean   :0.34   Mean   :0.06   Mean   :0.1   Mean   :4.94   Mean   :5.90  
##  3rd Qu.:1.00   3rd Qu.:0.00   3rd Qu.:0.0   3rd Qu.:6.04   3rd Qu.:6.88  
##  Max.   :1.00   Max.   :1.00   Max.   :1.0   Max.   :8.00   Max.   :9.33  
##                                              NA's   :7      NA's   :7     
##     corrupt        instqual        buroqual         goodgov    
##  Min.   :0.00   Min.   :0.833   Min.   : 0.667   Min.   :1.95  
##  1st Qu.:3.00   1st Qu.:3.180   1st Qu.: 3.130   1st Qu.:3.99  
##  Median :4.39   Median :3.790   Median : 3.940   Median :4.87  
##  Mean   :4.38   Mean   :4.154   Mean   : 4.239   Mean   :4.72  
##  3rd Qu.:5.79   3rd Qu.:5.340   3rd Qu.: 5.300   3rd Qu.:5.53  
##  Max.   :8.71   Max.   :8.330   Max.   :10.000   Max.   :7.40  
##  NA's   :7      NA's   :7       NA's   :7        NA's   :7     
##     ruleolaw       pubadmin        growth            lcon     
##  Min.   :2.33   Min.   :1.25   Min.   :-0.038   Min.   :5.53  
##  1st Qu.:4.33   1st Qu.:3.25   1st Qu.:-0.005   1st Qu.:6.32  
##  Median :5.02   Median :4.17   Median : 0.002   Median :6.60  
##  Mean   :5.00   Mean   :4.31   Mean   : 0.004   Mean   :6.67  
##  3rd Qu.:5.97   3rd Qu.:5.49   3rd Qu.: 0.013   3rd Qu.:7.01  
##  Max.   :7.61   Max.   :9.36   Max.   : 0.056   Max.   :8.04  
##  NA's   :7      NA's   :7      NA's   :6        NA's   :6     
##      lconsq           i               g            vlegit     
##  Min.   :30.6   Min.   : 1.40   Min.   :11.1   Min.   :0.000  
##  1st Qu.:39.9   1st Qu.: 5.41   1st Qu.:18.7   1st Qu.:0.000  
##  Median :43.5   Median : 9.86   Median :22.9   Median :0.000  
##  Mean   :44.8   Mean   :10.25   Mean   :23.9   Mean   :0.213  
##  3rd Qu.:49.1   3rd Qu.:14.32   3rd Qu.:27.8   3rd Qu.:0.000  
##  Max.   :64.6   Max.   :25.62   Max.   :44.2   Max.   :1.000  
##  NA's   :6      NA's   :6       NA's   :6      NA's   :3      
##      hlegit           elf           hieafvm        hieafvs     
##  Min.   :0.000   Min.   :0.040   Min.   :0.67   Min.   :0.000  
##  1st Qu.:0.330   1st Qu.:0.620   1st Qu.:1.52   1st Qu.:0.000  
##  Median :0.582   Median :0.715   Median :1.84   Median :0.480  
##  Mean   :0.572   Mean   :0.651   Mean   :1.86   Mean   :0.503  
##  3rd Qu.:0.850   3rd Qu.:0.827   3rd Qu.:2.00   3rd Qu.:0.790  
##  Max.   :1.000   Max.   :0.930   Max.   :3.00   Max.   :1.490  
##  NA's   :4       NA's   :12      NA's   :12     NA's   :12     
##      warciv        language    
##  Min.   : 0.0   Min.   : 0.10  
##  1st Qu.: 0.0   1st Qu.: 1.90  
##  Median : 0.0   Median : 4.00  
##  Mean   : 6.2   Mean   : 6.53  
##  3rd Qu.: 8.0   3rd Qu.: 8.30  
##  Max.   :38.0   Max.   :27.70  
##                 NA's   :9

If you ever encounter trouble importing foreign data formats into R, a good option is to use a piece of software called StatTransfer, which can convert between dozens of different file formats. Using StatTransfer to convert a file format into a CSV or R .RData format will essentially guarantee that it is readable by R.

Reading Excel files

Sometimes we need to read data in from Excel. In almost every situation, it is easiest to use Excel to convert this kind of file into a comma-separated CSV file first and then load it into R using read.csv. That said, there are several packages designed to read Excel foramts directly, but all have disadvantages.

XLConnect can read a variety of Excel formats, but requires you to have Java installed on your computer.
xlsx also uses a Java library
gdata, in addition to many other things, includes a read.xls function that can read Excel .xls files, but requires having Perl installed on your machine
RDCOMClient can also read Excel files and interact with them dynamically, but is also not available on CRAN.
RExcelXML can read post-2007 era Excel files, but is also not on CRAN.
xlsReadWrite requires propriety software to do the file conversion and therefore isn't available on CRAN Thus, while there are many options for reading Excel files, none has become the recommended method for loading these files. XLConnect perhaps provides the preferred method, but - reiterating the above point - it is often just easier to convert an Excel file to CSV rather than trying to load the Excel file directly.

Notes on other data situations

Sometimes one encounters data in formats that are neither traditional, text-based tabular formats (like CSV or TSV) or proprietary statistical formats (like .dta, .sav, etc.). For example, you sometimes encounter data that is recorded in an XML markup format or that is saved in “fixed-width format”, and so forth. So long as the data is human-readable (i.e., text), you will be able to find or write R code to deal with these files and convert them to an R dataframe. Depending on the file format, this may be time consuming, but everything is possible.

XML files can easily be read using the XML package. Indeed, its functions xmlToDataFrame and xmlToList easily convert almost any well-formed XML document into a dataframe or list, respectively.

Fixed-width file formats are some of the hardest file formats to deal with. These files, typically built during the 20th Century, are digitized versions of data that was originally stored on punch cards. For example, much of the pre-2000 public opinion data archived at the Roper Center for Public Opinion Research's iPoll databank is stored in fixed width format. These formats store data as rows of numbers without variable names, value delimiters (like the comma or tab), and require a detailed codebook to translate them into human- or computer-readable data. For example, the following 14 lines represent the first two records of a public opinion data file from 1998:

000003204042898                    248 14816722  1124 13122292122224442 2 522  1
0000032222222444444444444444144444444424424                                    2
000003          2     1    1    2    312922 3112422222121222          42115555 3
00000355554115           553722211212221122222222352   42       4567   4567    4
000003108 41 52 612211                    1                229                 5
000003                                                                         6
000003    20                                                01.900190 0198     7
000012212042898                    248 14828523  1113 1312212111111411142 5213 1
0000122112221111141244412414114224444444144                                    2
000012          1     2    1    2    11212213123112232322113          31213335 3
00001255333115           666722222222221122222226642   72       4567   4567    4
000012101261 511112411                    1                212                 5
000012                                                                         6
000012    32                                                01.630163 0170     7

Clearly, these data are not easily interpretable despite the fact that there is some obvious pattern to the data. As long as we have a file indicating what each number means, we can use the read.fwf function (from base R) to translate this file into a dataframe. The code is tedious, so there isn't space to demonstrate it here, but know that it is possible.