Compartmentalized Documented Extendible Reproducible Robust

In Part 1, I will discuss how to make an R package, with specific emphasis on data R packages, and show you how to document functions and data with Roxygen2.

By the end of this workshop, you will be able to build your own mini R package using RStudio. I’ll show you how to host it on GitHub with a nice little landing page.

If/when you want to go into R packaging in more depth, see Hadley Wickham’s book R Packages.

Why a package?

Shorter answer

  • The package framework really helps you write robust code and well documented code.
  • It makes it easy to bundle data with code.
  • It make it easy to version and document your data.

Longer answer

An R package is an easy and the standard way to organize your R code, document your code, and share your code with other people. Why use an R package rather than just make a bunch of scripts with your data in a folder?

  • Reproducibility and documentation In the long-run, you will save yourself much work if you organize and document your code. Rather than writing a series of scripts that you copy and alter for each project, you think about how to make your scripts into functions.
  • You want to share your code If you are making code to that can be used for different data, rather than only your specific problem, then you want to make a package so that you can share your code.
  • Robust data sharing! Putting your data in a dedicated data package allows you to version your data (so everyone knows they are using the most up to date data), document your data, track data changes, provide data releases (with archives), provide easy visualizations of the data, and any other packages can load that data package and have access to the data.
  • You want to make an application If you want to make a shiny application, having your code in a package will help.

Set-up for PSAW III

No set-up necessary. You only need a browser as we’ll be working using RStudio Cloud.

For RStudio users who want to do this on their computer

You’ll need to install {roxygen2}, {knitr}, {rmarkdown} at least.

Mac users: You don’t need to do anything. Package building should just work.

Windows users: Try running this code and see what happens. You need to install {devtools} package if you don’t have it.

# install.packages("devtools")
devtools::install_github("RVerse-Tutorials/TestPackage")

If that code complains, then you need to install RTools. Note there is a different RTools for R 4.0.0 (released in April 2020) versus earlier R releases. Look for the little link for earlier versions of RTools if you don’t have 4.0.0 installed. Technically, it says you only need RTools to install packages with C/C++ so you might be fine. Personally, I always install RTools on my Windows machines since I install packages with C/C++ sometimes. But to keep things simple, try building a package without RTools and see if it works.

A simple package

On RStudio Cloud

Open this link, TestPackage

Click to see how to do this in RStudio your computer

This makes a basic package using RStudio on your computer but it won’t be exactly like TestPackage.

  1. Open RStudio
  2. In the upper right hand corner, click the blue cube with R, and click New Project.
  3. Click ‘New Directory’ and choose R package.
  4. Name your package TestPackage and select the directory where to put it.
  5. Click Create Package.
  6. Click on the ‘Build’ tab in the upper right, and click ‘Install and Restart’. Your package should build and load.
  7. Click on click ‘check’. Your package should pass all the checks without errors.

Parts of an R package

The essentials

2 files and a directory.

  • DESCRIPTION This file has the meta-data about your package. Name and what packages it depends on. Most of it is self-explanatory. The Depends: and Imports: lines specify any functions from other packages that you use in your functions.

  • NAMESPACE This file indicates what needs to be exposed to users for your R package. For our course, you won’t need to edit as {roxygen2} takes care of it.

  • R directory This is where all your R code goes for your package.

Basic add-ons

  • man A directory for documentation. You won’t need to write this. It will be added automatically by {roxygen2}.

  • data A directory for data files saved in RData format with the ending .rda or .RData. Nothing else!

Other add-ons

  • inst folder for misc stuff

  • inst\extdata folder for external data.

  • data-raw A directory for raw data files that produced the data files in data folder.

  • .Rbuildignore optional, but in practice you will always need this.

Let’s add to TestPackage

Double-check that Project Options are correct

  1. Click on Tools > Project Options > Build Tools

  2. Make sure Generate documentation with Roxygen is checked. Don’t see that? Then you need to install the {roxygen2} package.

  3. Click Configure next to the Roxygen line. Make sure all the checkboxes are checked. The last 2 won’t be by default.

Add a function to TestPackage

  1. Create a new R script file. File > New File > R Script.

  2. Paste this code into the script and save as hello.R in the R directory.

#' Hello!
#' 
#' This just says hello.
#' @export
hello <- function(){ cat("HELLO") }
  1. Click Install and Restart from the Build tab.

  2. Click Check from the Build tab to make sure we didn’t make any errors.

Use your new function

Learn about your function with

?hello

Use your function with

hello()

Add data

  1. Add a folder called data

  2. Run these lines from the command line.

WWW2 <- WWWusage^2
save(WWW2, file="data/WWW2.rda")
  1. Click Install and Restart from the Build tab

  2. Now your data are available from your package. Type

WWW2
littleforecast(WWW2)

at the command line.

The key components

The DESCRIPTION file

Open the file named DESCRIPTION. Most of it is self-explanatory.

  • Depends: means the user will have all the commands of that package at the command line.
  • Imports: is any other R packages that your package needs in order to work but it’s functions won’t be available at the command line (unless you choose).
Package: TestPackage
Title: This Is A Toy Package
Version: 1.3
Author: Eli Holmes
Maintainer: <eli.holmes@noaa.gov>
Description: This is a super simple toy package for students to copy and experiment with for the short course.
Depends: R (>= 3.4.1)
Imports: forecast, ggplot2
License: GPL-2
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.1.2

The packages on the Depends and Imports lines are required to be installed in order to install your package. If the user doesn’t have these packages, then they will be installed when installing the package. When you try to Build and Install, R will complain and throw an error if you are missing packages.

The NAMESPACE file

This file has the commands to export the functions (in the R folder) to the command line for use. If you don’t have a function here, the user will need to use ::: to access the function.

We will use {roxygen2} to make our NAMESPACE file.

exportPattern("^[[:alpha:]]+")
export(littleforecast)

The first line means “export all functions”. The next line is exporting the littleforecast function.

How does {roxygen2} know to export a function? Add this to the documentation code at the top of your functions.

#' @export

The R Directory: Function code

This is where functions are put and our data documentation files. Each file is a separate function. You can put multiple functions in one file, but that can get confusing unless they are small functions. The top of the function has documentation in {roxygen2} format.

#' @title A little foo function
#'
#' @description This little function does this.
#'
#' @param arg1 what this argument is
#' @export foo
foo <- function(arg1){
# The work
return(<what you want to return to user>)
}

.Rbuildignore

Though not required, in practice you will need to tell R what not to include in your package. RStudio will make this for you but you need to check it and add more stuff.

^.*\.Rproj$
^\.Rproj\.user$
^TestPackage\.Rcheck$
^TestPackage.*\.tar\.gz$
^TestPackage.*\.tgz$
.github
.git

Let’s add another function to TestPackage

  1. Create a new R script file. File > New File > R Script.

  2. Paste this code into the script and save as hello.R in the R directory.

#' dplyr example
#' 
#' This adds a new function that needs {dplyr}
#' @param col which column to average
#' @export
irisaverages <- function(col = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")){
col <- match.arg(col)
iris$col <- iris[[col]]
iris %>% dplyr::group_by(Species) %>% 
   dplyr::summarize(mean = mean(col))
}

We now use {dplyr} and %>% (pipe).

We can either add {dplyr} to Depends in our DESCRIPTION file but that would load the whole {dplyr} library and maybe we don’t want to do that.

We can add {dplyr} to Imports but how to get %>%? Add a file import_packages.R to the R folder (the name of the file is unimportant).

#' @importFrom magrittr %>%
NULL

How would I ever remember this?? Sadly if your use piping, you’ll gets lots of practice with this.

Data package in R

Open this link, ToyDataPackage

A data package can be exactly the same as a code package except that you don’t have much in the R folder and you have a lot in the data folder. A “data” package is just dedicated to data. There is nothing else very special about it.

New parts of the package

  1. A data-raw folder. Optional but can but used for the raw data.
  2. data folder. Required. Your data save as rda or RData.
  3. An R script in the R folder with documentation for your data.

Example

cars2 <- read.csv("data-raw/cars2.csv", row.names=1)
save(cars2, file="data/cars2.rda")

And then in the R folder is data-cars2.R. Tip, it is good to give your data documentation scripts a clear name tag to distinguish them from functions.

#' @title a dataset of horsepower for different cars
#'
#' @description First 4 columns of the mtcars dataset.
#'
#' \itemize{
#'   \item mpg. miles per gallon
#'   \item cyl. cylinders
#'   \item disp. displacement
#'   \item hp. horse poser
#' }
#'
#' @docType data
#' @name cars2
#' @usage data(cars2)
#' @references R base package.
#' @format A data frame.
#' @keywords datasets
NULL

Note, in the latest Roxygen2, you don’t need the @name but that only works if you use LazyData: true in your DESCRIPTION file. For a pure data package, you might not want to do that.

  1. Click Install and Restart from the Build tab.

Let’s use our new data package in a R Markdown document.

---
title: "Untitled"
output: html_document
---

```{r}
library(ToyDataPackage)
```

```{r, echo=FALSE}
library(ToyDataPackage)
data(cars2)
knitr::kable(cars2, 
             caption=paste("This is version", packageVersion("ToyDataPackage')))
```

File names, data names and documentation

rda file names

The rda filename in the data folder is what is used to load data. For example, let’s say you have

save(cars1, cars2, file="data/carsdata.rda")

So 2 data objects saved to one rda file. To load both data objects, you use

data(carsdata)

Data documentation

What do I document: cars1, cars2 or carsdata? You can actually do whatever you want.

Do this to show this documentation with ?cars2.

#' @title a dataset of horsepower for different cars
#'
#' @docType data
#' @name cars2
NULL

Do this to show this documentation with ?cars1, ?cars2, and ?carsdata

#' @title some datasets of horsepower for different cars
#'
#' @docType data
#' @name carsdata
#' @aliases cars1 cars2
NULL

Do this to show this documentation with ?carsdata.

#' @title three datasets of horsepower for different cars
#'
#' @docType data
#' @name carsdata
NULL

This will only work for data that are exported. That means Lazydata: true and what is loaded from data(carsdata).

#' @title three datasets of horsepower for different cars
#'
#' @docType data
"cars2"

So this fails since it is not carsdata that is exported. That is just the name of the data file.

#' @title three datasets of horsepower for different cars
#'
#' @docType data
"carsdata"

Changes to that workflow

  • Can I have multiple data objects saved to one rda file? Yes. Use @alias in your Roxygen2 file (in the R folder) to use the same documentation for each data object or create separate documentation files for each data object.
  • My data is not csv files. That was just an example. For documentation purposes, it is now recommended to use data-raw so that you have the raw data and the rda files in the data directory. You can put whatever you want into data-raw.
  • Won’t including the raw data make my R package huge? Yes, it would. In your .Rbuildignore file, add the line ^data-raw$ to not include that in a build.
  • Would you always put raw data in data-raw? No. Another common place is inst\extdata. Which one you use is up to you. I use extdata more as a sandbox and it will have all sorts of info used to make the data files.
  • My data are huge or the raw data should not be on any repo. I don’t want them part of the package at all. That’s fine. You only need the data folder.

Comment on Roxygen2 headers for data

The R Packages section on documenting data shows you how to write your Roxygen2 code to document data.

But keep the following in mind. The Roxygen2 code that is shown in the R Packages book above is the “new” style which needs LazyData: true in your DESCRIPTION file. Here’s how the new Roxygen2 code looks. Notice no @name mydata and NULL at the bottom is replaced with "mydata".

#' @title My Data
#'
#' @description My dataset on diamonds and here is more info.
#'
#' \itemize{
#'   \item price. price in US dollars
#'   \item carat. weight of the diamond
#' }
#'
#' @docType data
#' @usage data(mydata)
#' @format A data frame with 10 rows and 2 variables
"mydata"

If you changed LazyData: false, all that Roxygen2 code is going to fail. So I personally would never use the new Roxygen2 “shortcut”.

Why would you ever set LazyData: false? Because some of your data have the same name. I make R data packages with 100s of datasets with the exact same structure and same name. I use them like so where dat is a character string name of my data:

plot_data <- function(dat){
data(dat)
gpplot(salmon, aes(x=year, y=spawners)) + geom_point()
}

All my data are stored with the name salmon not with the data file name.

So like so:

salmon <- read.csv(file="data-raw/columbia-river-chinook-esu.csv")
save(salmon, file="data/columbia-river-chinook-esu.rda")

I don’t ever want to refer to the Columbia River data as columbia-river-chinook-esu. In my workflow, that wouldn’t make sense.

But in other applications, it often makes sense to give your data a specific name, like sst or nooksack-river or thedata. In that case, the style in the R Packages section on documenting data is fine.

Examples of real data packages

  • SardineForecast A complex package with lots of different types of data.
  • VRData This is a data repo that I am developing for NWFSC application.

Creating your own R package

The base package

  • Create a new R package in RStudio via New Project in upper right corner.
  • Once you select New Project, select New Directory > R package. That will create a template for your.
  • Install the {roxygen2} package if needed.
  • Open Tools > Project Options > Build Tools. Then click the checkbox to create documentation with roxygen2. Next click Configure and make sure all the checkboxes are checked (all of them)
  • Delect NAMESPACE since you are going to have {roxygen2} make this.
  • From the Build tab, click ‘Install and Restart’. Now the {roxygen2} info will be added to your DESCRIPTION file and {roxygen2} will make your NAMESPACE
  • Go into the R folder and replace the hello.R code with this so you have a better starting template. Later you will develop your own {roxygen2} template so you don’t have to remember how to document a function.
#' This is the title
#'
#' This is the description
#'
#' @param arg1 describe what arg1 is
#'
#' @export
#' @keywords functions
hello <- function(arg1) {
  print("Hello, world!")
}

Add functions to your package

  • Add .R files to the R folder with your {roxygen2} header.

Add data to your package

  • Create data folder.
  • Add .rda (or .RData doesn’t matter which) files to the data folder.
  • Add documentation files for your data objects in the R folder. The file should have .R ending but the filename doesn’t matter. I suggest something like data-dataname.R. Using data- tag will help you find all your data documentation files.

If you use LazyData: true and your data all have unique names, your data documentation file (e.g. data-dataname.R which you put in the R folder) can take the form:

#' title
#'
#' data description
#'
"dataname"

where dataname is the name of the data object. Note {roxygen2} will throw an error if dataname is not a loaded data object in your data folder.

If you use LazyData: false, your data do not have unique names, or you want more flexibility in creating your data documentation, use something like this for your data documentation file:

#' title
#'
#' data description
#'
#' @docType data
#' @name dataname
NULL

The @name part specifies what name will call up help when the users types ?dataname. dataname does not need to exist as an object. You can use any text you want.

Want the same documentation to come up for multiple names? For example, the same documentation page to appear for ?species1 and ?species2 and ?surveydata? Use @aliases like so

#' @name surveydata
#' @aliases species1 species2

Note, if you want to keep your raw data and code to convert that into the rda files with the package, put in data-raw. To keep that directory out of the package that users install, add data-raw to the .Rbuildignore file.

Checking your package

  • Run Check from the Build tab regularly. It is best not to put this off too long or else you’ll create a lot more work for yourself as you track down one warning or error after another.
  • Don’t let yourself suffer trying to figure out what the warnings and errors mean. Most are simple namespace issues that are easy to fix once you know how. Ask someone who builds packages. The NMFS R UG Group Chat is a good place.

NWFSC Math Bio Program