Saving R Data

We frequently need to save our data after we have worked on it for some time (e.g., because we've created scaled or deleted variables, created a subset of our original data, modified the data in a time- or processor-intensive way, or simply need to share a subset of the data). In most statistical packages, this is done automatically: those packages open a file and “destructively” make changes to the original file. This can be convenient, but it is also problematic. If I change a file and don't save the original, my work is no longer reproducible from the original file. It essentially builds a step into the scientific workflow that is not explicitly recorded. R does things differently. When opening a data file in R, the data are read into memory and the link between those data in memory and the original file is severed. Changes made to the data are kept only in R and they are lost if R is closed without the data being saved. This is usually fine because good workflow involves writing scripts that work from the original data, make any necessary changes, and then produce output. But, for the reasons stated above, we might want to save our working data for use later on. R provides at least four ways to do this. Note: All of the methods overwrite the system file by default. This means that writing a file over an existing file is “destructive,” so it's a good idea to make sure that you're not overwriting a file by checking to make sure your filename isn't already in use using list.files(). By default, the file is written to your working directory (getwd()) but can be written elsewhere if you supply a file path rather than name.

All of these methods work with an R dataframe, so we'll create a simple one just for the sake of demonstration:

set.seed(1)
mydf <- data.frame(x = rnorm(100), y = rnorm(100), z = rnorm(100))

“” save ## The most flexible way to save data objects from R uses the save function. By default, save writes an R object (or multiple R objects) to an R-readable binary file that can be opened using load. Because save can store multiple objects (including one's entire current workspace), it provides a very flexible way to “pick up where you left off.” For example, using save.image('myworkspace.RData'), you could save everything about your current R workspace, and then load('myworkspace.RData') later and be exactly where you were before. But it is also a convenient way to write data to a file that you plan to use again in R. Because it saves R objects “as-is,” there's no need to worry about problems reading in the data or needing to change structure or variable names because the file is saved (and will load) exactly as it looks in R. The dataframe will even have the same name (i.e., in our example, the loaded object will be caleld mydf). The .RData file format is also very space-saving, thus taking up less room than a comparable comma-separated variable file containing the same data. To write our dataframe using save, we simply supply the name of the dataframe and the destination file:

save(mydf, file = "saveddf.RData")

Note that the file name is not important (so long as it does not overwrite another file in your working directory). If you load the file using load, the R object mydf will appear in your workspace. Let's remove the file just to not leave a mess:

unlink("saveddf.RData")

dput (and dget)

Sometimes we want to be able to write our data in a way that makes it exactly reproducible (like save), but we also want to be able to read the file. Because save creates a binary file, we can only open the file in R (or another piece of software that reads .RData files). If we want, for example, to be able to look at or change the file in a text editor, we need it in another format. One R-specific solution for this is dput. The dput function saves data as an R expression. This means that the resulting file can actually be copied and pasted into the R console. This is especially helpful if you want to share (part of) your data with someone else. Indeed, it is rquired that when you ask data-related questions on StackOverflow, that you supply your data using dput to make it easy for people to help you. We can also simply write the output of dput to the console to see what it looks like. Let's try that before writing it to a file:

dput(mydf)
## structure(list(x = c(-0.626453810742332, 0.183643324222082, -0.835628612410047, 
## 1.59528080213779, 0.329507771815361, -0.820468384118015, 0.487429052428485, 
## 0.738324705129217, 0.575781351653492, -0.305388387156356, 1.51178116845085, 
## 0.389843236411431, -0.621240580541804, -2.2146998871775, 1.12493091814311, 
## -0.0449336090152309, -0.0161902630989461, 0.943836210685299, 
## 0.821221195098089, 0.593901321217509, 0.918977371608218, 0.782136300731067, 
## 0.0745649833651906, -1.98935169586337, 0.61982574789471, -0.0561287395290008, 
## -0.155795506705329, -1.47075238389927, -0.47815005510862, 0.417941560199702, 
## 1.35867955152904, -0.102787727342996, 0.387671611559369, -0.0538050405829051, 
## -1.37705955682861, -0.41499456329968, -0.394289953710349, -0.0593133967111857, 
## 1.10002537198388, 0.763175748457544, -0.164523596253587, -0.253361680136508, 
## 0.696963375404737, 0.556663198673657, -0.68875569454952, -0.70749515696212, 
## 0.36458196213683, 0.768532924515416, -0.112346212150228, 0.881107726454215, 
## 0.398105880367068, -0.612026393250771, 0.341119691424425, -1.12936309608079, 
## 1.43302370170104, 1.98039989850586, -0.367221476466509, -1.04413462631653, 
## 0.569719627442413, -0.135054603880824, 2.40161776050478, -0.0392400027331692, 
## 0.689739362450777, 0.0280021587806661, -0.743273208882405, 0.188792299514343, 
## -1.80495862889104, 1.46555486156289, 0.153253338211898, 2.17261167036215, 
## 0.475509528899663, -0.709946430921815, 0.610726353489055, -0.934097631644252, 
## -1.2536334002391, 0.291446235517463, -0.443291873218433, 0.00110535163162413, 
## 0.0743413241516641, -0.589520946188072, -0.568668732818502, -0.135178615123832, 
## 1.1780869965732, -1.52356680042976, 0.593946187628422, 0.332950371213518, 
## 1.06309983727636, -0.304183923634301, 0.370018809916288, 0.267098790772231, 
## -0.54252003099165, 1.20786780598317, 1.16040261569495, 0.700213649514998, 
## 1.58683345454085, 0.558486425565304, -1.27659220845804, -0.573265414236886, 
## -1.22461261489836, -0.473400636439312), y = c(-0.620366677224124, 
## 0.0421158731442352, -0.910921648552446, 0.158028772404075, -0.654584643918818, 
## 1.76728726937265, 0.716707476017206, 0.910174229495227, 0.384185357826345, 
## 1.68217608051942, -0.635736453948977, -0.461644730360566, 1.43228223854166, 
## -0.650696353310367, -0.207380743601965, -0.392807929441984, -0.319992868548507, 
## -0.279113302976559, 0.494188331267827, -0.177330482269606, -0.505957462114257, 
## 1.34303882517041, -0.214579408546869, -0.179556530043387, -0.100190741213562, 
## 0.712666307051405, -0.0735644041263263, -0.0376341714670479, 
## -0.681660478755657, -0.324270272246319, 0.0601604404345152, -0.588894486259664, 
## 0.531496192632572, -1.51839408178679, 0.306557860789766, -1.53644982353759, 
## -0.300976126836611, -0.528279904445006, -0.652094780680999, -0.0568967778473925, 
## -1.91435942568001, 1.17658331201856, -1.664972436212, -0.463530401472386, 
## -1.11592010504285, -0.750819001193448, 2.08716654562835, 0.0173956196932517, 
## -1.28630053043433, -1.64060553441858, 0.450187101272656, -0.018559832714638, 
## -0.318068374543844, -0.929362147453702, -1.48746031014148, -1.07519229661568, 
## 1.00002880371391, -0.621266694796823, -1.38442684738449, 1.86929062242358, 
## 0.425100377372448, -0.238647100913033, 1.05848304870902, 0.886422651374936, 
## -0.619243048231147, 2.20610246454047, -0.255027030141015, -1.42449465021281, 
## -0.144399601954219, 0.207538339232345, 2.30797839905936, 0.105802367893711, 
## 0.456998805423414, -0.077152935356531, -0.334000842366544, -0.0347260283112762, 
## 0.787639605630162, 2.07524500865228, 1.02739243876377, 1.2079083983867, 
## -1.23132342155804, 0.983895570053379, 0.219924803660651, -1.46725002909224, 
## 0.521022742648139, -0.158754604716016, 1.4645873119698, -0.766081999604665, 
## -0.430211753928547, -0.926109497377437, -0.17710396143654, 0.402011779486338, 
## -0.731748173119606, 0.830373167981674, -1.20808278630446, -1.04798441280774, 
## 1.44115770684428, -1.01584746530465, 0.411974712317515, -0.38107605110892
## ), z = c(0.409401839650934, 1.68887328620405, 1.58658843344197, 
## -0.330907800682766, -2.28523553529247, 2.49766158983416, 0.667066166765493, 
## 0.5413273359637, -0.0133995231459087, 0.510108422952926, -0.164375831769667, 
## 0.420694643254513, -0.400246743977644, -1.37020787754746, 0.987838267454879, 
## 1.51974502549955, -0.308740569225614, -1.25328975560769, 0.642241305677824, 
## -0.0447091368939791, -1.73321840682484, 0.00213185968026965, 
## -0.630300333928146, -0.340968579860405, -1.15657236263585, 1.80314190791747, 
## -0.331132036391221, -1.60551341225308, 0.197193438739481, 0.263175646405474, 
## -0.985826700409291, -2.88892067167955, -0.640481702565115, 0.570507635920485, 
## -0.05972327604261, -0.0981787440052344, 0.560820728620116, -1.18645863857947, 
## 1.09677704427424, -0.00534402827816569, 0.707310667398079, 1.03410773473746, 
## 0.223480414915304, -0.878707612866019, 1.16296455596733, -2.00016494478548, 
## -0.544790740001725, -0.255670709156989, -0.166121036765006, 1.02046390878411, 
## 0.136221893102778, 0.407167603423836, -0.0696548130129049, -0.247664341619331, 
## 0.69555080661964, 1.1462283572158, -2.40309621489187, 0.572739555245841, 
## 0.374724406778655, -0.425267721556076, 0.951012807576816, -0.389237181718379, 
## -0.284330661799574, 0.857409778079803, 1.7196272991206, 0.270054900937229, 
## -0.42218400978764, -1.18911329485959, -0.33103297887901, -0.939829326510021, 
## -0.258932583118785, 0.394379168221572, -0.851857092023863, 2.64916688109488, 
## 0.156011675665079, 1.13020726745494, -2.28912397984011, 0.741001157195439, 
## -1.31624516045156, 0.919803677609141, 0.398130155451956, -0.407528579269772, 
## 1.32425863017727, -0.70123166924692, -0.580614304240536, -1.00107218102542, 
## -0.668178606753393, 0.945184953373082, 0.433702149545162, 1.00515921767704, 
## -0.390118664053679, 0.376370291774648, 0.244164924486494, -1.42625734238254, 
## 1.77842928747545, 0.134447660933676, 0.765598999157864, 0.955136676908982, 
## -0.0505657014422701, -0.305815419766971)), .Names = c("x", "y", 
## "z"), row.names = c(NA, -100L), class = "data.frame")

As you can see, output is a complicated R expression (using the structure function), which includes all of the data values, the variable names, row names, and the class of the object. If you were to copy and paste this output into a new R session, you would have the exact same dataframe as the one we created here. We can write this to a file (with any extension) by specifying a file argument:

dput(mydf, "saveddf.txt")

I would tend to use the .txt (text file) extension, so that it will be easily openable in any text editor, but you can use any extension. Note: Unlike save and load, which store an R object and then restore it using the save name, dput does not store the name of the R object. So, if we want to load the dataframe again (using dget), we need to store the dataframe as a variable:

mydf2 <- dget("saveddf.txt")

Additionally, and again unlike save, dput only stores values up to a finite level of precision. So while our original mydf and the read-back-in dataframe mydf2 look very similar, they differ due the rules of floating point values (a basic element of computer programming that is unimportant to really understand):

head(mydf)
##         x        y       z
## 1 -0.6265 -0.62037  0.4094
## 2  0.1836  0.04212  1.6889
## 3 -0.8356 -0.91092  1.5866
## 4  1.5953  0.15803 -0.3309
## 5  0.3295 -0.65458 -2.2852
## 6 -0.8205  1.76729  2.4977
head(mydf2)
##         x        y       z
## 1 -0.6265 -0.62037  0.4094
## 2  0.1836  0.04212  1.6889
## 3 -0.8356 -0.91092  1.5866
## 4  1.5953  0.15803 -0.3309
## 5  0.3295 -0.65458 -2.2852
## 6 -0.8205  1.76729  2.4977
mydf == mydf2
##            x     y     z
##   [1,] FALSE FALSE FALSE
##   [2,] FALSE FALSE FALSE
##   [3,] FALSE FALSE FALSE
##   [4,] FALSE FALSE FALSE
##   [5,] FALSE FALSE FALSE
##   [6,] FALSE FALSE FALSE
##   [7,] FALSE FALSE  TRUE
##   [8,] FALSE FALSE FALSE
##   [9,] FALSE FALSE FALSE
##  [10,]  TRUE FALSE FALSE
##  [11,] FALSE  TRUE FALSE
##  [12,] FALSE  TRUE FALSE
##  [13,] FALSE FALSE FALSE
##  [14,]  TRUE FALSE FALSE
##  [15,] FALSE FALSE FALSE
##  [16,] FALSE FALSE FALSE
##  [17,] FALSE FALSE FALSE
##  [18,] FALSE FALSE FALSE
##  [19,] FALSE  TRUE FALSE
##  [20,] FALSE FALSE FALSE
##  [21,] FALSE FALSE FALSE
##  [22,] FALSE FALSE FALSE
##  [23,]  TRUE FALSE FALSE
##  [24,] FALSE FALSE FALSE
##  [25,] FALSE FALSE FALSE
##  [26,] FALSE FALSE FALSE
##  [27,] FALSE FALSE FALSE
##  [28,] FALSE FALSE FALSE
##  [29,] FALSE FALSE FALSE
##  [30,] FALSE FALSE FALSE
##  [31,] FALSE FALSE FALSE
##  [32,] FALSE FALSE FALSE
##  [33,] FALSE FALSE FALSE
##  [34,] FALSE FALSE FALSE
##  [35,] FALSE FALSE FALSE
##  [36,] FALSE FALSE FALSE
##  [37,] FALSE FALSE FALSE
##  [38,] FALSE FALSE FALSE
##  [39,] FALSE FALSE FALSE
##  [40,] FALSE FALSE FALSE
##  [41,] FALSE FALSE FALSE
##  [42,] FALSE FALSE FALSE
##  [43,] FALSE FALSE FALSE
##  [44,] FALSE FALSE FALSE
##  [45,] FALSE FALSE FALSE
##  [46,] FALSE FALSE FALSE
##  [47,] FALSE FALSE FALSE
##  [48,] FALSE FALSE FALSE
##  [49,] FALSE FALSE FALSE
##  [50,] FALSE FALSE FALSE
##  [51,] FALSE FALSE FALSE
##  [52,] FALSE FALSE FALSE
##  [53,] FALSE FALSE FALSE
##  [54,] FALSE FALSE FALSE
##  [55,] FALSE FALSE FALSE
##  [56,]  TRUE FALSE FALSE
##  [57,] FALSE FALSE FALSE
##  [58,] FALSE FALSE FALSE
##  [59,] FALSE FALSE FALSE
##  [60,] FALSE FALSE FALSE
##  [61,] FALSE FALSE FALSE
##  [62,] FALSE FALSE FALSE
##  [63,] FALSE  TRUE FALSE
##  [64,] FALSE FALSE FALSE
##  [65,] FALSE FALSE FALSE
##  [66,] FALSE FALSE FALSE
##  [67,] FALSE FALSE FALSE
##  [68,] FALSE FALSE FALSE
##  [69,] FALSE FALSE FALSE
##  [70,] FALSE FALSE FALSE
##  [71,] FALSE FALSE FALSE
##  [72,] FALSE FALSE FALSE
##  [73,] FALSE FALSE FALSE
##  [74,] FALSE FALSE  TRUE
##  [75,] FALSE FALSE FALSE
##  [76,] FALSE FALSE FALSE
##  [77,]  TRUE FALSE FALSE
##  [78,] FALSE FALSE FALSE
##  [79,] FALSE FALSE FALSE
##  [80,] FALSE FALSE FALSE
##  [81,] FALSE FALSE FALSE
##  [82,] FALSE FALSE FALSE
##  [83,] FALSE FALSE FALSE
##  [84,] FALSE FALSE FALSE
##  [85,] FALSE FALSE FALSE
##  [86,] FALSE  TRUE FALSE
##  [87,] FALSE FALSE FALSE
##  [88,] FALSE FALSE FALSE
##  [89,] FALSE FALSE FALSE
##  [90,] FALSE FALSE FALSE
##  [91,] FALSE FALSE FALSE
##  [92,] FALSE FALSE FALSE
##  [93,] FALSE FALSE FALSE
##  [94,] FALSE FALSE FALSE
##  [95,] FALSE FALSE FALSE
##  [96,] FALSE FALSE FALSE
##  [97,] FALSE FALSE FALSE
##  [98,] FALSE FALSE FALSE
##  [99,] FALSE FALSE FALSE
## [100,] FALSE FALSE  TRUE

Thus, a dataframe saved using save is exactly the same when reloaded into R whereas the one saved using dput is the same up to a lesser degree of precision.

Let's clean up that file so not to leave a mess:

unlink("saveddf.text")

dump (and source)

Similar to dput, the dump function writes the dput output to a file. Indeed, it write the exact same representation we saw above on the console. But, instead of writing an R expression that we have to save to a variable name later, dump preserves the name of our dataframe. Thus it is a blend between dput and save (but mostly it is like dput). dump also uses a default filename: "dumpdata.R", making it a shorter command to write and one that is less likely to be destructive (except to previous data dumps). Let's see how it works:

dump("mydf")

Note: We specify the dataframe name as a character string because this is written to the file so that when we load the "dumpdata.R" file, the dataframe has the same name as it does right now. We can load this dataframe into memory from the file using source:

source("dumpdata.R", echo = TRUE)
## 
## > mydf <-
## + structure(list(x = c(-0.626453810742332, 0.183643324222082, -0.835628612410047, 
## + 1.59528080213779, 0.329507771815361, -0.820468384118015 .... [TRUNCATED]

As you'll see in the (truncated) output of source, the file looks just like dput but includes mydf <- at the beginning, meaning it s storing the dput-like output into the mydf object in R memory. Note: dump can also take arbitrary file names to its file argument (like the save and dput).

Let's clean up that file so not to leave a mess:

unlink("dumpdata.R")

write.csv and write.table

One of the easiest ways to save an R dataframe is to write it to a comma-separated value (CSV) file. CSV files are human-readable (e.g., in a text editor) and can be opened by essentially any statistical software (Excel, Stata, SPSS, SAS, etc.) making them one of the best formats for data sharing. To save a dataframe as CSV is easy. You simply need to use the write.csv function with the name of the dataframe and the name of the file you want to write to. Let's see how it works:

write.csv(mydf, file = "saveddf.csv")

That's all there is to it. R also allows you to save files in other CSV-like formats. For example, sometimes we want to save data using a different separator such as a tab (i.e., to create a tab-separated value file or TSV). The TSV is, for example, the default file format used by The Dataverse Network online data repository. To write to a TSV we use a related function write.table and specify the sep argument:

write.table(mydf, file = "saveddf.tsv", sep = "\t")

Note: We use the \t symbol to represent a tab (a standard common to many programming languages). We could also specify any character as a separator, such as | or ; or . but commas and tabs are the most common. Note: Just like dput, writing to a CSV or another delimited-format file necessarily includes some loss of precision, which may or may not be problematic for your particular use case.

Let's clean up our files just so we don't leave a mess:

unlink("savedf.csv")
unlink("savedf.tsv")

Writing to “foreign” file formats

The foreign package, which we can use to load “foreign” file formats also includes a write.foreign function that can be used to write an R dataframe to a foreign, proprietary data format. Supported formats include SPSS, Stata, and SAS.