Compartmentalized | Documented | Extendible | Reproducible | Robust |
In Part 1, I will discuss how to make an R package, with specific emphasis on data R packages, and show you how to document functions and data with Roxygen2.
By the end of this workshop, you will be able to build your own mini R package using RStudio. I’ll show you how to host it on GitHub with a nice little landing page.
If/when you want to go into R packaging in more depth, see Hadley Wickham’s book R Packages.
An R package is an easy and the standard way to organize your R code, document your code, and share your code with other people. Why use an R package rather than just make a bunch of scripts with your data in a folder?
No set-up necessary. You only need a browser as we’ll be working using RStudio Cloud.
You’ll need to install {roxygen2}, {knitr}, {rmarkdown} at least.
Mac users: You don’t need to do anything. Package building should just work.
Windows users: Try running this code and see what happens. You need to install {devtools} package if you don’t have it.
# install.packages("devtools")
devtools::install_github("RVerse-Tutorials/TestPackage")
If that code complains, then you need to install RTools. Note there is a different RTools for R 4.0.0 (released in April 2020) versus earlier R releases. Look for the little link for earlier versions of RTools if you don’t have 4.0.0 installed. Technically, it says you only need RTools to install packages with C/C++ so you might be fine. Personally, I always install RTools on my Windows machines since I install packages with C/C++ sometimes. But to keep things simple, try building a package without RTools and see if it works.
Open this link, TestPackage
This makes a basic package using RStudio on your computer but it won’t be exactly like TestPackage.
TestPackage
and select the directory where to put it.2 files and a directory.
DESCRIPTION This file has the meta-data about your package. Name and what packages it depends on. Most of it is self-explanatory. The Depends:
and Imports:
lines specify any functions from other packages that you use in your functions.
NAMESPACE This file indicates what needs to be exposed to users for your R package. For our course, you won’t need to edit as {roxygen2} takes care of it.
R directory This is where all your R code goes for your package.
man A directory for documentation. You won’t need to write this. It will be added automatically by {roxygen2}.
data A directory for data files saved in RData format with the ending .rda
or .RData
. Nothing else!
inst
folder for misc stuff
inst\extdata
folder for external data.
data-raw
A directory for raw data files that produced the data files in data
folder.
.Rbuildignore
optional, but in practice you will always need this.
Click on Tools > Project Options > Build Tools
Make sure Generate documentation with Roxygen is checked. Don’t see that? Then you need to install the {roxygen2} package.
Click Configure next to the Roxygen line. Make sure all the checkboxes are checked. The last 2 won’t be by default.
Create a new R script file. File > New File > R Script.
Paste this code into the script and save as hello.R
in the R directory.
#' Hello!
#'
#' This just says hello.
#' @export
hello <- function(){ cat("HELLO") }
Click Install and Restart from the Build tab.
Click Check from the Build tab to make sure we didn’t make any errors.
Learn about your function with
?hello
Use your function with
hello()
Add a folder called data
Run these lines from the command line.
WWW2 <- WWWusage^2
save(WWW2, file="data/WWW2.rda")
Click Install and Restart from the Build tab
Now your data are available from your package. Type
WWW2
littleforecast(WWW2)
at the command line.
Open the file named DESCRIPTION. Most of it is self-explanatory.
Depends:
means the user will have all the commands of that package at the command line.Imports:
is any other R packages that your package needs in order to work but it’s functions won’t be available at the command line (unless you choose).Package: TestPackage
Title: This Is A Toy Package
Version: 1.3
Author: Eli Holmes
Maintainer: <eli.holmes@noaa.gov>
Description: This is a super simple toy package for students to copy and experiment with for the short course.
Depends: R (>= 3.4.1)
Imports: forecast, ggplot2
License: GPL-2
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.1.2
The packages on the Depends and Imports lines are required to be installed in order to install your package. If the user doesn’t have these packages, then they will be installed when installing the package. When you try to Build and Install, R will complain and throw an error if you are missing packages.
This file has the commands to export the functions (in the R folder) to the command line for use. If you don’t have a function here, the user will need to use :::
to access the function.
We will use {roxygen2} to make our NAMESPACE file.
exportPattern("^[[:alpha:]]+")
export(littleforecast)
The first line means “export all functions”. The next line is exporting the littleforecast
function.
How does {roxygen2} know to export a function? Add this to the documentation code at the top of your functions.
#' @export
This is where functions are put and our data documentation files. Each file is a separate function. You can put multiple functions in one file, but that can get confusing unless they are small functions. The top of the function has documentation in {roxygen2} format.
#' @title A little foo function
#'
#' @description This little function does this.
#'
#' @param arg1 what this argument is
#' @export foo
foo <- function(arg1){
# The work
return(<what you want to return to user>)
}
.Rbuildignore
Though not required, in practice you will need to tell R what not to include in your package. RStudio will make this for you but you need to check it and add more stuff.
^.*\.Rproj$
^\.Rproj\.user$
^TestPackage\.Rcheck$
^TestPackage.*\.tar\.gz$
^TestPackage.*\.tgz$
.github
.git
Create a new R script file. File > New File > R Script.
Paste this code into the script and save as hello.R
in the R directory.
#' dplyr example
#'
#' This adds a new function that needs {dplyr}
#' @param col which column to average
#' @export
irisaverages <- function(col = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")){
col <- match.arg(col)
iris$col <- iris[[col]]
iris %>% dplyr::group_by(Species) %>%
dplyr::summarize(mean = mean(col))
}
We now use {dplyr} and %>%
(pipe).
We can either add {dplyr} to Depends
in our DESCRIPTION file but that would load the whole {dplyr} library and maybe we don’t want to do that.
We can add {dplyr} to Imports
but how to get %>%
? Add a file import_packages.R
to the R folder (the name of the file is unimportant).
#' @importFrom magrittr %>%
NULL
How would I ever remember this?? Sadly if your use piping, you’ll gets lots of practice with this.
Open this link, ToyDataPackage
A data package can be exactly the same as a code package except that you don’t have much in the R
folder and you have a lot in the data
folder. A “data” package is just dedicated to data. There is nothing else very special about it.
data-raw
folder. Optional but can but used for the raw data.data
folder. Required. Your data save as rda
or RData
.R
folder with documentation for your data.cars2 <- read.csv("data-raw/cars2.csv", row.names=1)
save(cars2, file="data/cars2.rda")
And then in the R folder is data-cars2.R
. Tip, it is good to give your data documentation scripts a clear name tag to distinguish them from functions.
#' @title a dataset of horsepower for different cars
#'
#' @description First 4 columns of the mtcars dataset.
#'
#' \itemize{
#' \item mpg. miles per gallon
#' \item cyl. cylinders
#' \item disp. displacement
#' \item hp. horse poser
#' }
#'
#' @docType data
#' @name cars2
#' @usage data(cars2)
#' @references R base package.
#' @format A data frame.
#' @keywords datasets
NULL
Note, in the latest Roxygen2, you don’t need the @name
but that only works if you use LazyData: true
in your DESCRIPTION
file. For a pure data package, you might not want to do that.
Let’s use our new data package in a R Markdown document.
---
title: "Untitled"
output: html_document
---
```{r}
library(ToyDataPackage)
```
```{r, echo=FALSE}
library(ToyDataPackage)
data(cars2)
knitr::kable(cars2,
caption=paste("This is version", packageVersion("ToyDataPackage')))
```
rda
file namesThe rda
filename in the data
folder is what is used to load data. For example, let’s say you have
save(cars1, cars2, file="data/carsdata.rda")
So 2 data objects saved to one rda
file. To load both data objects, you use
data(carsdata)
What do I document: cars1
, cars2
or carsdata
? You can actually do whatever you want.
Do this to show this documentation with ?cars2
.
#' @title a dataset of horsepower for different cars
#'
#' @docType data
#' @name cars2
NULL
Do this to show this documentation with ?cars1
, ?cars2
, and ?carsdata
#' @title some datasets of horsepower for different cars
#'
#' @docType data
#' @name carsdata
#' @aliases cars1 cars2
NULL
Do this to show this documentation with ?carsdata
.
#' @title three datasets of horsepower for different cars
#'
#' @docType data
#' @name carsdata
NULL
This will only work for data that are exported. That means Lazydata: true
and what is loaded from data(carsdata)
.
#' @title three datasets of horsepower for different cars
#'
#' @docType data
"cars2"
So this fails since it is not carsdata
that is exported. That is just the name of the data file.
#' @title three datasets of horsepower for different cars
#'
#' @docType data
"carsdata"
rda
file? Yes. Use @alias
in your Roxygen2 file (in the R folder) to use the same documentation for each data object or create separate documentation files for each data object.data-raw
so that you have the raw data and the rda
files in the data
directory. You can put whatever you want into data-raw
..Rbuildignore
file, add the line ^data-raw$
to not include that in a build.data-raw
? No. Another common place is inst\extdata
. Which one you use is up to you. I use extdata
more as a sandbox and it will have all sorts of info used to make the data
files.data
folder.hello.R
code with this so you have a better starting template. Later you will develop your own {roxygen2} template so you don’t have to remember how to document a function.#' This is the title
#'
#' This is the description
#'
#' @param arg1 describe what arg1 is
#'
#' @export
#' @keywords functions
hello <- function(arg1) {
print("Hello, world!")
}
.R
files to the R
folder with your {roxygen2} header.data
folder..rda
(or .RData
doesn’t matter which) files to the data
folder..R
ending but the filename doesn’t matter. I suggest something like data-dataname.R
. Using data-
tag will help you find all your data documentation files.If you use LazyData: true
and your data all have unique names, your data documentation file (e.g. data-dataname.R
which you put in the R folder) can take the form:
#' title
#'
#' data description
#'
"dataname"
where dataname
is the name of the data object. Note {roxygen2} will throw an error if dataname
is not a loaded data object in your data
folder.
If you use LazyData: false
, your data do not have unique names, or you want more flexibility in creating your data documentation, use something like this for your data documentation file:
#' title
#'
#' data description
#'
#' @docType data
#' @name dataname
NULL
The @name
part specifies what name will call up help when the users types ?dataname
. dataname
does not need to exist as an object. You can use any text you want.
Want the same documentation to come up for multiple names? For example, the same documentation page to appear for ?species1
and ?species2
and ?surveydata
? Use @aliases
like so
#' @name surveydata
#' @aliases species1 species2
Note, if you want to keep your raw data and code to convert that into the rda
files with the package, put in data-raw
. To keep that directory out of the package that users install, add data-raw
to the .Rbuildignore
file.
Comment on Roxygen2 headers for data
The R Packages section on documenting data shows you how to write your Roxygen2 code to document data.
But keep the following in mind. The Roxygen2 code that is shown in the R Packages book above is the “new” style which needs
LazyData: true
in yourDESCRIPTION
file. Here’s how the new Roxygen2 code looks. Notice no@name mydata
andNULL
at the bottom is replaced with"mydata"
.If you changed
LazyData: false
, all that Roxygen2 code is going to fail. So I personally would never use the new Roxygen2 “shortcut”.Why would you ever set
LazyData: false
? Because some of your data have the same name. I make R data packages with 100s of datasets with the exact same structure and same name. I use them like so wheredat
is a character string name of my data:All my data are stored with the name
salmon
not with the data file name.So like so:
I don’t ever want to refer to the Columbia River data as
columbia-river-chinook-esu
. In my workflow, that wouldn’t make sense.But in other applications, it often makes sense to give your data a specific name, like
sst
ornooksack-river
orthedata
. In that case, the style in the R Packages section on documenting data is fine.