In order to use R for data analysis, we need to get our data into R. Unfortunately, because R lacks a graphical user interface, loading data is not particularly intuitive for those used to working with other statistical software. This tutorial explains how to load data into R as a dataframe object.
As a preliminary note, one of the things about R that causes a fair amount of confusion is that R reads character data, by default, as factor. In other words, when your data contain alphanumeric character strings (e.g., names of countries, free response survey questions), R will read those data in as factor variables rather than character variables. This can be changed when reading in data using almost any of the following techniques by setting a stringsAsFactors=FALSE
argument.
A second point of difficulty for beginners to R is that R offers no obvious visual way to load data into R. Lacking a full graphical user interface, there is no “open” button to read in a dataset. The closest thing to this is the file.choose
function. If you don't know the name or location of a file you want to load, you can use file.choose()
to open a dialog window that will let you select a file. The response, however, is just a character string containing the name and full path of the file. No action is taken with regard to that file. If, for example, you want to load a comma-separated value file (described below), you could make a call like the following:
# read.csv(file.choose())
This will first open the file choose dialog window and, when you select a file, R will then process that file with read.csv
and return a dataframe.
While file.choose
is a convenient function for interactively working with R. It is generally better to manually write filenames into your code to maximize reproducibility.
One of the neat little features of R is that it comes with some built-in datasets, and many add-on packages supply additional datasets to demonstrate their functionality. We can access these datasets with the data()
function. Here we'll just print the first few datasets:
head(data()$results)
## Package LibPath Item
## [1,] "car" "C:/Program Files/R/R-3.0.2/library" "AMSsurvey"
## [2,] "car" "C:/Program Files/R/R-3.0.2/library" "Adler"
## [3,] "car" "C:/Program Files/R/R-3.0.2/library" "Angell"
## [4,] "car" "C:/Program Files/R/R-3.0.2/library" "Anscombe"
## [5,] "car" "C:/Program Files/R/R-3.0.2/library" "Baumann"
## [6,] "car" "C:/Program Files/R/R-3.0.2/library" "Bfox"
## Title
## [1,] "American Math Society Survey Data"
## [2,] "Experimenter Expectations"
## [3,] "Moral Integration of American Cities"
## [4,] "U. S. State Public-School Expenditures"
## [5,] "Methods of Teaching Reading Comprehension"
## [6,] "Canadian Women's Labour-Force Participation"
Datasets in the dataset package are pre-loaded with R and can simply be called by name from the R console. For example, we can see the “Monthly Airline Passenger Numbers 1949-1960” dataset by simply calling:
AirPassengers
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432
To obtain detailed information about the datasets, you can just access the dataset documention: ? AirPassengers
.
We generally want to work with our own data, however, rather than some arbitrary dataset, so we'll have to load data into R.
Because a dataframe is just a collection of data vectors, we can always enter data by hand into the R console. For example, let's say we have two variables (height
and weight
) measured on each of six observations. We can enter these simply by typing them into the console and combining them into a dataframe, like:
height <- c(165, 170, 163, 182, 175, 190)
weight <- c(45, 60, 70, 80, 63, 72)
mydf <- cbind.data.frame(height, weight)
We can then call our dataframe by name:
mydf
## height weight
## 1 165 45
## 2 170 60
## 3 163 70
## 4 182 80
## 5 175 63
## 6 190 72
R also provides a function called scan
that allows us to type data into a special prompt. For example, we might want to read in six values of gender for our observations above and we could do that by typing mydf$gender <- scan(n=6, what="numeric")
and entering the six values, one per line when prompted.
But entering data manually in this fashion is inefficient and doesn't make sense if we already have data saved in an external file.
The easiest data to load into R comes in tabular file formats, like comma-separated value (CSV) or tab-separated value (TSV) files. These can easily be created using a spreadsheet editor (like Microsoft Excel), a text editor (like Notepad), or exported from many other computer programs (including all statistical packages).
read.table
and its variantsThe general function for reading these kinds of data is called read.table
. Two other functions, read.csv
and read.delim
, provide convenient wrappers for reading CSV and TSV files, respectively. (Note: read.csv2
and read.delim2
provide slightly different wrappers designed for reading data that uses a semicolon rather than comma separator and a comma rather than a period as the decimal point.)
Reading in data that is in CSV format is easy. For example, let's read in the following file, which contains some data about patient admissions for five patients:
patient,dob,entry,discharge,fee,sex
001,10/21/1946,12/12/2004,12/14/2004,8000,1
002,05/01/1980,07/08/2004,08/08/2004,12000,2
003,01/01/1960,01/01/2004,01/04/2004,9000,2
004,06/23/1998,11/11/2004,12/25/2004,15123,1
We can read these data in from from the console by copying and pasting them into a command like the following:
mydf <- read.csv(text = "\npatient,dob,entry,discharge,fee,sex\n001,10/21/1946,12/12/2004,12/14/2004,8000,1\n002,05/01/1980,07/08/2004,08/08/2004,12000,2\n003,01/01/1960,01/01/2004,01/04/2004,9000,2\n004,06/23/1998,11/11/2004,12/25/2004,15123,1")
mydf
## patient dob entry discharge fee sex
## 1 1 10/21/1946 12/12/2004 12/14/2004 8000 1
## 2 2 05/01/1980 07/08/2004 08/08/2004 12000 2
## 3 3 01/01/1960 01/01/2004 01/04/2004 9000 2
## 4 4 06/23/1998 11/11/2004 12/25/2004 15123 1
Or, we can read them from the local file directly:
mydf <- read.csv("../Data/patient.csv")
Reading them in either way will produce the exact same dataframe. If the data were tab- or semicolon-separated, the call would be exactly the same except for the use of read.delim
and read.csv2
, respectively.
Note: Any time we read data into R, we need to store it as a variable, otherwise it will simply be printed to the console and we won't be able to do anything with it. You can name dataframes whatever you want.
scan
and readLines
Occasionally, we need to read in data as a vector of character strings rather than as delimited data to make a dataframe. For example, we might have a file that contains textual data (e.g., from a news story) and we want to read in each word or each line of the file as a separate element of a vector in order to perform some kind of text processing on it.
To do this kind of analysis we can use one of two functions. The scan
function we used above to manually enter data at the console can also be used to read data in from a file, as can another function called readLines
.
We can see how the two functions work by first writing some miscellaneous text to a file (using cat
) and then reading in that content:
cat("TITLE", "A first line of text", "A second line of text", "The last line of text",
file = "ex.data", sep = "\n")
We can use scan
to read in the data as a vector of words:
scan("ex.data", what = "character")
## [1] "TITLE" "A" "first" "line" "of" "text" "A"
## [8] "second" "line" "of" "text" "The" "last" "line"
## [15] "of" "text"
The scan
function accepts additional arguments such n
to specify the number of lines to read from the file and sep
to specify how to divide the file into separate entries in the resulting vector:
scan("ex.data", what = "character", sep = "\n")
## [1] "TITLE" "A first line of text" "A second line of text"
## [4] "The last line of text"
scan("ex.data", what = "character", n = 1, sep = "\n")
## [1] "TITLE"
We can do the same thing with readLines
, which assumes that we want to read each line as a complete string rather than separating the file contents in some way:
readLines("ex.data")
## [1] "TITLE" "A first line of text" "A second line of text"
## [4] "The last line of text"
It also accepts an n
argument:
readLines("ex.data", n = 2)
## [1] "TITLE" "A first line of text"
Let's delete the file we created just to cleanup:
unlink("ex.data") # tidy up
R has its own fill format called .RData that can be used to store data for use in R. It is fairly rare to encounter data in this format, but reading it into R is - as one might expect - very easy. You simply need to call load('thefile.RData')
and the objects stored in the file will be loaded into memory in R.
One context in which you might use an .RData file is when saving your R workspace. When you quite R (using q()
), R asks if you want to save your workspace. If you select “yes”, R stores all of the objects currently in memory to a .RData file. This file can then be load
ed in a subsequent R session to pick up quite literally exactly where you left off when you saved the file.
Because many people use statistical packages like SAS, SPSS, and Stata for statistical analysis, much of the data available in the world is saved in proprietary file formats created and owned by the the companies that publish that software. This is bad because those data formats are deprecated (i.e., made irrelevant) quite often (e.g., when Stata upgraded to version 11, it introduced a new file format and its older file formats were no longer compatible with the newest version of the software). This creates problems for reproducibility because not everyone has access to Stata (or to SPSS or SAS) and storing data in these formats makes it harder to share data and ties data to specific software owned by specific companies. Editorializing aside, R can import data from a variety of proprietary file formats. Doing so requires one of the recommended add-on packages called foreign. Let's load it here:
library(foreign)
The foreign package can be used to import data from a variety of proprietary formats, including Stata .dta formats (using the read.dta
function), Octave or Matlab .mat formats (using read.octave), SPSS .sav formats (using
read.spss), SAS permanent .sas7bdat formats (using
read.ssd) and SAS XPORT .stx or .xpt formats (using
read.xport), Systat .syd formats (using
read.systat), and Minitab .tmp formats (using
read.mtp).
Note: The **foreign** package sometimes has trouble with SPSS formats, but these files can also be opened with the
spss.getfunction from the **Hmisc** package or one of several functions from the **memisc** package (
spss.fixed.file,
spss.portable.file, and
spss.system.file).
We can try loading some “foreign” data stored in Stata format:
englebert <- read.dta("../Data/EnglebertPRQ2000.dta")
## Warning: cannot read factor labels from Stata 5 files
We can then look at the loaded data using any of our usual object examination functions:
dim(englebert) # dimensions
## [1] 50 27
head(englebert) # first few rows
## country wbcode indep paris london brussels lisbon commit exprop
## 1 ANGOLA AGO 1975 0 0 0 1 3.820 5.36
## 2 BENIN BEN 1960 1 0 0 0 4.667 6.00
## 3 BOTSWANA BWA 1966 0 1 0 0 6.770 7.73
## 4 BURKINA FASO BFA 1960 1 0 0 0 5.000 4.45
## 5 BURUNDI BDI 1962 0 0 1 0 6.667 7.00
## 6 CAMEROON CMR 1960 1 0 0 0 6.140 6.45
## corrupt instqual buroqual goodgov ruleolaw pubadmin growth lcon
## 1 5.000 2.7300 4.470 4.280 3.970 4.73 -0.0306405 6.594
## 2 1.333 3.0000 2.667 3.533 4.556 2.00 -0.0030205 6.949
## 3 6.590 8.3300 6.140 7.110 7.610 6.36 0.0559447 6.358
## 4 6.060 5.3000 4.170 5.000 4.920 5.11 -0.0000589 6.122
## 5 3.000 0.8333 4.000 4.300 4.833 3.50 -0.0036746 6.461
## 6 4.240 4.5500 6.670 5.610 5.710 5.45 0.0147910 6.463
## lconsq i g vlegit hlegit elf hieafvm hieafvs warciv language
## 1 43.49 3.273 34.22 0 0.5250 0.78 1.00 0.00 24 4.2
## 2 48.29 6.524 22.79 0 0.6746 0.62 2.67 0.47 0 5.3
## 3 40.42 22.217 27.00 1 0.9035 0.51 2.00 0.00 0 3.1
## 4 37.48 7.858 17.86 0 0.5735 0.68 1.25 0.97 0 4.8
## 5 41.75 4.939 13.71 1 0.9800 0.04 3.00 0.00 8 0.6
## 6 41.77 8.315 20.67 0 0.8565 0.89 1.50 0.76 0 8.3
names(englebert) # column/variable names
## [1] "country" "wbcode" "indep" "paris" "london" "brussels"
## [7] "lisbon" "commit" "exprop" "corrupt" "instqual" "buroqual"
## [13] "goodgov" "ruleolaw" "pubadmin" "growth" "lcon" "lconsq"
## [19] "i" "g" "vlegit" "hlegit" "elf" "hieafvm"
## [25] "hieafvs" "warciv" "language"
str(englebert) # object structure
## 'data.frame': 50 obs. of 27 variables:
## $ country : chr "ANGOLA" "BENIN" "BOTSWANA" "BURKINA FASO" ...
## $ wbcode : chr "AGO" "BEN" "BWA" "BFA" ...
## $ indep : num 1975 1960 1966 1960 1962 ...
## $ paris : num 0 1 0 1 0 1 0 1 1 1 ...
## $ london : num 0 0 1 0 0 0 0 0 0 0 ...
## $ brussels: num 0 0 0 0 1 0 0 0 0 0 ...
## $ lisbon : num 1 0 0 0 0 0 1 0 0 0 ...
## $ commit : num 3.82 4.67 6.77 5 6.67 ...
## $ exprop : num 5.36 6 7.73 4.45 7 ...
## $ corrupt : num 5 1.33 6.59 6.06 3 ...
## $ instqual: num 2.73 3 8.33 5.3 0.833 ...
## $ buroqual: num 4.47 2.67 6.14 4.17 4 ...
## $ goodgov : num 4.28 3.53 7.11 5 4.3 ...
## $ ruleolaw: num 3.97 4.56 7.61 4.92 4.83 ...
## $ pubadmin: num 4.73 2 6.36 5.11 3.5 ...
## $ growth : num -3.06e-02 -3.02e-03 5.59e-02 -5.89e-05 -3.67e-03 ...
## $ lcon : num 6.59 6.95 6.36 6.12 6.46 ...
## $ lconsq : num 43.5 48.3 40.4 37.5 41.8 ...
## $ i : num 3.27 6.52 22.22 7.86 4.94 ...
## $ g : num 34.2 22.8 27 17.9 13.7 ...
## $ vlegit : num 0 0 1 0 1 0 1 0 0 0 ...
## $ hlegit : num 0.525 0.675 0.904 0.573 0.98 ...
## $ elf : num 0.78 0.62 0.51 0.68 0.04 ...
## $ hieafvm : num 1 2.67 2 1.25 3 ...
## $ hieafvs : num 0 0.47 0 0.97 0 ...
## $ warciv : num 24 0 0 0 8 0 0 0 29 0 ...
## $ language: num 4.2 5.3 3.1 4.8 0.6 ...
## - attr(*, "datalabel")= chr ""
## - attr(*, "time.stamp")= chr "25 Mar 2000 18:07"
## - attr(*, "formats")= chr "%21s" "%9s" "%9.0g" "%9.0g" ...
## - attr(*, "types")= int 148 133 102 102 102 102 102 102 102 102 ...
## - attr(*, "val.labels")= chr "" "" "" "" ...
## - attr(*, "var.labels")= chr "Name of country" "World Bank three-letter code" "Date of independence" "Colonization by France" ...
## - attr(*, "version")= int 5
summary(englebert) # summary
## country wbcode indep paris
## Length:50 Length:50 Min. : -4 Min. :0.00
## Class :character Class :character 1st Qu.:1960 1st Qu.:0.00
## Mode :character Mode :character Median :1962 Median :0.00
## Mean :1921 Mean :0.38
## 3rd Qu.:1968 3rd Qu.:1.00
## Max. :1993 Max. :1.00
## NA's :2
## london brussels lisbon commit exprop
## Min. :0.00 Min. :0.00 Min. :0.0 Min. :1.68 Min. :2.00
## 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.0 1st Qu.:4.00 1st Qu.:4.50
## Median :0.00 Median :0.00 Median :0.0 Median :5.00 Median :6.05
## Mean :0.34 Mean :0.06 Mean :0.1 Mean :4.94 Mean :5.90
## 3rd Qu.:1.00 3rd Qu.:0.00 3rd Qu.:0.0 3rd Qu.:6.04 3rd Qu.:6.88
## Max. :1.00 Max. :1.00 Max. :1.0 Max. :8.00 Max. :9.33
## NA's :7 NA's :7
## corrupt instqual buroqual goodgov
## Min. :0.00 Min. :0.833 Min. : 0.667 Min. :1.95
## 1st Qu.:3.00 1st Qu.:3.180 1st Qu.: 3.130 1st Qu.:3.99
## Median :4.39 Median :3.790 Median : 3.940 Median :4.87
## Mean :4.38 Mean :4.154 Mean : 4.239 Mean :4.72
## 3rd Qu.:5.79 3rd Qu.:5.340 3rd Qu.: 5.300 3rd Qu.:5.53
## Max. :8.71 Max. :8.330 Max. :10.000 Max. :7.40
## NA's :7 NA's :7 NA's :7 NA's :7
## ruleolaw pubadmin growth lcon
## Min. :2.33 Min. :1.25 Min. :-0.038 Min. :5.53
## 1st Qu.:4.33 1st Qu.:3.25 1st Qu.:-0.005 1st Qu.:6.32
## Median :5.02 Median :4.17 Median : 0.002 Median :6.60
## Mean :5.00 Mean :4.31 Mean : 0.004 Mean :6.67
## 3rd Qu.:5.97 3rd Qu.:5.49 3rd Qu.: 0.013 3rd Qu.:7.01
## Max. :7.61 Max. :9.36 Max. : 0.056 Max. :8.04
## NA's :7 NA's :7 NA's :6 NA's :6
## lconsq i g vlegit
## Min. :30.6 Min. : 1.40 Min. :11.1 Min. :0.000
## 1st Qu.:39.9 1st Qu.: 5.41 1st Qu.:18.7 1st Qu.:0.000
## Median :43.5 Median : 9.86 Median :22.9 Median :0.000
## Mean :44.8 Mean :10.25 Mean :23.9 Mean :0.213
## 3rd Qu.:49.1 3rd Qu.:14.32 3rd Qu.:27.8 3rd Qu.:0.000
## Max. :64.6 Max. :25.62 Max. :44.2 Max. :1.000
## NA's :6 NA's :6 NA's :6 NA's :3
## hlegit elf hieafvm hieafvs
## Min. :0.000 Min. :0.040 Min. :0.67 Min. :0.000
## 1st Qu.:0.330 1st Qu.:0.620 1st Qu.:1.52 1st Qu.:0.000
## Median :0.582 Median :0.715 Median :1.84 Median :0.480
## Mean :0.572 Mean :0.651 Mean :1.86 Mean :0.503
## 3rd Qu.:0.850 3rd Qu.:0.827 3rd Qu.:2.00 3rd Qu.:0.790
## Max. :1.000 Max. :0.930 Max. :3.00 Max. :1.490
## NA's :4 NA's :12 NA's :12 NA's :12
## warciv language
## Min. : 0.0 Min. : 0.10
## 1st Qu.: 0.0 1st Qu.: 1.90
## Median : 0.0 Median : 4.00
## Mean : 6.2 Mean : 6.53
## 3rd Qu.: 8.0 3rd Qu.: 8.30
## Max. :38.0 Max. :27.70
## NA's :9
If you ever encounter trouble importing foreign data formats into R, a good option is to use a piece of software called StatTransfer, which can convert between dozens of different file formats. Using StatTransfer to convert a file format into a CSV or R .RData format will essentially guarantee that it is readable by R.
Sometimes we need to read data in from Excel. In almost every situation, it is easiest to use Excel to convert this kind of file into a comma-separated CSV file first and then load it into R using read.csv
. That said, there are several packages designed to read Excel foramts directly, but all have disadvantages.
read.xls
function that can read Excel .xls files, but requires having Perl installed on your machineSometimes one encounters data in formats that are neither traditional, text-based tabular formats (like CSV or TSV) or proprietary statistical formats (like .dta, .sav, etc.). For example, you sometimes encounter data that is recorded in an XML markup format or that is saved in “fixed-width format”, and so forth. So long as the data is human-readable (i.e., text), you will be able to find or write R code to deal with these files and convert them to an R dataframe. Depending on the file format, this may be time consuming, but everything is possible.
XML files can easily be read using the XML package. Indeed, its functions xmlToDataFrame
and xmlToList
easily convert almost any well-formed XML document into a dataframe or list, respectively.
Fixed-width file formats are some of the hardest file formats to deal with. These files, typically built during the 20th Century, are digitized versions of data that was originally stored on punch cards. For example, much of the pre-2000 public opinion data archived at the Roper Center for Public Opinion Research's iPoll databank is stored in fixed width format. These formats store data as rows of numbers without variable names, value delimiters (like the comma or tab), and require a detailed codebook to translate them into human- or computer-readable data. For example, the following 14 lines represent the first two records of a public opinion data file from 1998:
000003204042898 248 14816722 1124 13122292122224442 2 522 1
0000032222222444444444444444144444444424424 2
000003 2 1 1 2 312922 3112422222121222 42115555 3
00000355554115 553722211212221122222222352 42 4567 4567 4
000003108 41 52 612211 1 229 5
000003 6
000003 20 01.900190 0198 7
000012212042898 248 14828523 1113 1312212111111411142 5213 1
0000122112221111141244412414114224444444144 2
000012 1 2 1 2 11212213123112232322113 31213335 3
00001255333115 666722222222221122222226642 72 4567 4567 4
000012101261 511112411 1 212 5
000012 6
000012 32 01.630163 0170 7
Clearly, these data are not easily interpretable despite the fact that there is some obvious pattern to the data. As long as we have a file indicating what each number means, we can use the read.fwf
function (from base R) to translate this file into a dataframe. The code is tedious, so there isn't space to demonstrate it here, but know that it is possible.