Should R packages document research results?

Recently I’ve been hard at work modeling leaf temperature in evergreen forests. For various reasons, ecologists love to write R packages. This means that existing data access APIs and modeling code will most likely be available as an R package. This is a good thing - R is easy to read, quick to learn, and has a vibrant community of volunteer maintainers. I strongly believe that research, especially data-intensive work, should be reproducible. In this vein I wanted to take my project, refactor it as an R package, and then publish it on GitHub alongside the (forthcoming) manuscript.

How hard could that refactoring be? The answer: quite hard. R makes some fundamental design choices that make it difficult to go from a researcher’s typical R script to a package. In this post I’ll discuss a few reasons why this is so, and suggest a better way to package research.

The beginning

Let’s say you are a researcher and you work in R. You organize your projects around a series of .R files that are structured something like this.

library(somePackage)
library(anotherPackage)
library(yetAnotherPackage)

...
Do some kind of analysis
...

# Save results
write_csv(spicy_result, "data_out/spicy_result.csv")

This structure supports the interactive workflow people love about R. In this mode, you rarely source a whole file. Instead you go line-by-line, inspect outputs and plots, and iterate on an initial idea. In other words, you are free to put code that does different things all in one file. As you iterate on one file, your stack of library() calls at the top will get pretty large, even if you aren’t necessarily using all those libraries later on (when is the last time you saw an R tutorial use ::?) Your script is less a self-contained program, and more a sketchpad that you return to as your analysis evolves.

Suppose further that you use a directory structure to separate your data, scripts, and other outputs in a project. I like a structure like this:

.
├── scripts/
│   ├── util.R
│   └── models/
│       └── fit_my_model.R
├── data_in/
│   └── some_raw_data.csv
└── data_out/
    └── model_fit.csv

This is very good practice to keep yourself organized.

An important thing to note here is that fit_my_model.R might need access to functions in util.R to work. Assuming the working directory is the root of the project, you make these available with source("scripts/util.R") at the top of fit_my_model.R.

Issue 1: Everything needs to be a function now

Step 1 in making a package from a jumble of standalone scripts is to start converting your code into functions. Although this is definitely work, there are two reasons why converting code to functions isn’t so bad. First, you can get off to a pretty good start by just surrounding a code block in function() {...} (RStudio even has a shortcut for this). Second, writing functions encourages modularity, keeps you from repeating yourself, and can get you to think harder about how you design your workflow. Still, depending on how late in your development process you decide to make a package, rewriting as functions can take a while.

Issue 2: `library()` and `source()` are frowned upon

When writing standalone scripts, library() and source() are effectively an informal dependency mechanism. You script depends on an external package if it attaches the package via library() at the top of the file. Likewise, your script depends on an external file via source(). In an R package, dependency rules are laid out formally in the DESCRIPTION file. Dependencies also have different “levels”. In order of importance, you dependencies can be listed under Depends (the dependency must be attached when your package loads), Imports (you call functions in the dependency, but the whole namespace doesn’t need to be loaded), or Suggests (the dependency is perhaps useful, but not required).

Given this functionality, R packages should never use library(). Likewise, source() will never work in a package because you won’t know where your source code lives on the end user’s machine. I suppose you could get around that by hiding code in inst/ and then using system.file to source it. Hacks like that are definitely not a good development pattern.

So now we have to go through all our scripts and delete all the library() and source() calls. Not too much work. Then this raises another issue…

Issuse 3: Now you have to import all your external code

This is the first issue that took me a while to fix in my project. If you can’t make an external function available through library(), you have to list it in your package’s DESCRIPTION file and make it available accordingly. If you list an external package under Depends, you are done. External packages in Depends get attached when your package gets loaded, so it is as if you had called library() yourself. However, you can’t just spam every external package you used under Depends. There are general reasons not to, but on a more practical level CRAN will reject packages that have too many dependencies.

So how much work is it to make external functions available through Imports? The usethis package has a helpful function use_import_from that will take care of the bookkeeping in DESCRIPTION and NAMESPACE. The problem is that you now have to manually call use_import_from for every external function in your package. If you, like me, had a large codebase with mandatory dependencies, this is a boring process of running your code, seeing where it fails, adding an import, running again, and so on until you are almost-but-not-100%-sure that you imported everything correctly.

Issue 4: All your .R’s have to live in one folder

Ok so you’ve done quite a bit of refactoring already. At this point you can probably devtools::install your package and do some proper testing. But, you may find that some of your own functions are not available. Why is this? In my case, it came down to the directory structure of the package. R packages have a mandatory structure. At minimum, an R package contains the following:

.
├── DESCRIPTION
├── NAMESPACE
├── LICENSE
└── R/

Here, DESCRIPTION, NAMESPACE, and LICENSE are fairly standard boilerplate files that can be auto-generated by the devtools package. Nothing too bad there. We hit a problem with the R/ directory. This folder contains all of your files containing R code and it is not allowed to have any subdirectories. No, really.. A slightly amusing implication: aaa.R and zzz.R are conventionally used for hiding miscellaneous code that helps a package function so that they are easy to find in a directory. It definitely isn’t my place to comment on whether this was a good design choice. It is my place to argue that not allowing subfolders is a difference in convention between R users and R package developers that slows down package development. This also prevents developers from using submodules, like in Python, that encourage modular programming.

Of all the issues here, this is perhaps the easiest one to fix. Simply dump all your files in R, make your filenames a little better, and you will be good to go.

Conclusion: Use notebooks

All told I spent about a week refactoring my project into an R package. Was it worth it? Probably not. The package barely works and I don’t plan on submitting it to CRAN. Does the package reproduce my science? Sure, if someone is motivated enough to get everything working.

So if R packages are not the best vehicle for reproducibility, is there a better alternative? In my opinion, R notebooks with the addition of functions in external files are the best way to make your analysis reproducible. I have been skeptical of R notebooks because RStudio encourages you to knit them to another file format at the end of your analysis. This knitting process often brings up additional issues unrelated to the content of the notebook. Also, the development environment for R notebooks is seriously lacking. In RStudio only about 1/4 of the screen space is occupied by the notebook editor pane, results disappear when a code chunk is modified but not rerun, markdown chunks have to be rendered by the knitting process, and R objects are not displayed in a human-readable format by default (necessitating packages like kable). By contrast to the Python community, Jupyter Lab dedicates the whole screen to the notebook, results persist even when code is modified or rearranged, you can seamlessly toggle between raw and rendered markdown without having to go through a knitting process, and Python objects “know” how to display nicely in notebook format (e.g. DataFrames). Part of why Jupyter notebooks are so popular is that they blend the IDE with the finished product. R notebooks, in their current state, do not do this.

Why are R notebooks the best option available despite these disadvantages? At least two alternatives, writing a package and sending someone your Git repo full of .R files, are much worse. You don’t have to follow rules about namespaces, so library() and source() are right at home in the notebook format and, by extension, you can spread any external files across subdirectories however you so choose and get at their functions with source(). Not everything has to be a function; R notebooks encourage the sketchpad-style of development that is endemic to standalone scripts. And even if you have to deal with the knitting process, notebooks capture output such that your work can be reproduced.

TL;DR: is writing an R package to document research results worth it? No, unless you think your project will evolve into a tool. For documenting result, a notebook is your best bet.