Note in the Q&A (and more briefly in the lecture), I talked about data from the data frame specified in ggplot(data=df)
flowing into geom_...(data=df2)
. That doesn’t actually happen. See the notes below.
The lecture is to introduce you to the way that ggplot2 is structured because it is quite different to base plotting in R and it takes some getting used to. Once you are ready to go more indepth, here is some material that I have found useful.
I am not going to cover data-wrangling with tidyverse, but that is also a key skill for data science. I am not covering it because I think that is something that is not well suited to a 1 hour lecture. Here are some links to get you started:
data=df
Note In the Q&A, I was talking about problems with inheriting data frame information when you have multiple data frames in your ggplot()
code. Basically, I implied that your ggplot code doesn’t always respect something like geom_line(data=df1)
and would use the data frame defined in the ggplot()
part of your code. That is not true.
In my example it appeared to do so but it was actually using val
that I had defined earlier in my environment.
So this code will properly throw an error because df2
does not have y
. Note, I use data=df
in the ggplot()
call so you clearly see what is data
in the function calls.
library(ggplot2)
df <- data.frame(x=1:10, y=1:10, x2=1:10+2, y2=1:10+2)
df2 <- data.frame(x=df$x+10, y2=df$y)
ggplot(data=df, aes(x=x, y=y)) + geom_point(data=df2)
## Error in FUN(X[[i]], ...): object 'y' not found
If I define y
, then R will use that. But it doesn’t use the x
that I also defined, because the data in ggplot()
is searched first.
y <- 1
x <- 1
ggplot(data=df, aes(x=x, y=y)) + geom_point(data=df2)
Being more specific in my aes()
call tells R not to go looking for x
and y
outside the data frame specified by the data
argument and the code properly throws an error.
ggplot(data=df, aes(x=.data$x, y=.data$y)) + geom_point(data=df2)
## Error: Column `y` not found in `.data`
Note what is happening in this code: * The geom_line()
has a new data defined in data=df2
* But it doesn’t have its needed aesthetics (x and y), so it inherits that from ggplot()
.
Sometimes specifying your aes()
in your geom_...()
part using .data
is not enough because there are other aesthetics that are flowing through and you don’t want those. In that case, inherit.aes=FALSE
can be helpful. Here I don’t want aes(color="a")
to be inherited.
df <- data.frame(x=1:10, y=1:10, x2=1:10+2, y2=1:10+2)
df2 <- data.frame(x=df$x+10, y2=df$y)
ggplot(data=df, aes(x=.data$x, y=.data$y, color="a")) +
geom_point(data=df2, aes(x=.data$x, y=.data$y2), inherit.aes=FALSE)
Once you have multiple data frames in your code, it is very easy to introduce bugs that take hours to track down. It is really hard to keep track of what data frame is being inherited and how the aesthetics are being inherited.
Look at this code. What data is geom_point()
using? What aesthetics are geom_line()
and geom_col()
using? This code works but writing code like this can drive you crazy later on.
df1=data.frame(abc=1:10, xyz=1:10+rep(1:2,5), xyz2=1:10)
df2=data.frame(abc=1:10, xyz=0.5*df1$xyz, xyz2=0.25*df1$xyz, name=letters[rep(1:2,5)])
ggplot(df1, aes(x=abc, y=xyz)) +
geom_line(data=df2, aes(color=name)) +
geom_point(aes(y=xyz))
geom_line()
is geom_line(data=df2, aes(x=abc, y=xyz, color=name)
. The aes(x=abc, y=xyz)
is inherited from ggplot()
and aes(color=name)
is specified in the call.geom_point()
is geom_point(data=df1, aes(x=abc, y=xyz2)
. data=df1
and aes(x=abc)
are inherited from ggplot()
and aes(y=xyz2)
is specified in the call. aes(color=name)
is not inherited from the geom_line()
above.What happens if I had written very similar code but like this.
df1=data.frame(abc=1:10, xyz=1:10+rep(1:2,5), xyz2=1:10)
df2=data.frame(abc=1:10, xyz=0.5*df1$xyz, xyz2=0.25*df1$xyz, name=letters[rep(1:2,5)])
ggplot(df2, aes(x=abc, y=xyz, color=name)) +
geom_line() +
geom_point(data=df1, aes(y=xyz2))
## Error in FUN(X[[i]], ...): object 'name' not found
Why does this throw an error now??
geom_line()
is geom_line(data=df2, aes(x=abc, y=xyz, color=name)
. It inherited the data and aes()
from ggplot()
.geom_point()
is geom_point(data=df1, aes(x=abc, y=xyz2, color=name)
. data=df1
and aes(x=abc, color=name)
are inherited from ggplot()
and aes(y=xyz2)
is specified in the call. But name
is not in df1
so it throws an error.What’s a more robust way to write code when you have multiple data frames in the code? Here is one way, the data are clear and there is no danger of aesthetics being inherited.
ggplot() +
geom_line(data=df2, aes(x=.data$abc, y=.data$xyz, color=name)) +
geom_point(data=df1, aes(x=.data$abc, y=xyz2))
On stackoverflow, the answers would say that you should avoid multiple data frames and put all the data in one data frame that you specify in ggplot()
. In my experience, that is simply impractical. There are many cases where using multiple data frames is the best solution. BTW, I do try to avoid using multiple data frames in my ggplot code! That is a good policy, just not always efficient.
facet
when you have multiple data framesNote that without a data frame in ggplot()
, facet_...()
functions won’t work. You can add a data frame with the column name that you want to facet on, but the behavior might be non-intuitive. In this case, we have a data frame df3
in ggplot()
but it is not used elsewhere. The facet looks for any data frame that have the name
column (df2
does) and plots those in separate panels. df1
doesn’t have name
so it appears in every panel. This could be useful if there was a plot you needed to show in every panel and then other data that was separated across panels.
library(patchwork)
library(ggplot2)
df1=data.frame(abc=1:5, xyz=1:5, name2="a")
df2=data.frame(abc=1:10, xyz=1:10, name=c(rep(letters[1:3],3),"d"))
df3=data.frame(name="a")
p1 <- ggplot(df3) +
geom_point(data=df1, aes(x=.data$abc, y=.data$xyz)) +
geom_col(data=df2, aes(x=.data$abc, y=.data$xyz), alpha=0.5)
p1 + facet_wrap(~.data$name)
Note to those who do R packages, you need to always use .data
in ggplot2 code even if you don’t have multiple data frames. Otherwise your package won’t pass R CMD check
. So you have to write your calls like this:
ggplot(df, aes(x=.data$abc, y=.data$xyz)) + geom_point()
inherit.aes=FALSE
Another option for making your code more robust, is to use inherit.aes = FALSE
so that you don’t inadvertently inherit aesthetic information. That can help with bizarre behavior like this. I’m trying to get the plot on the right, but the one on the left is all wrong.
library(patchwork)
library(ggplot2)
val <- mtcars$mpg
x <- mtcars$hp
df1 <- data.frame(x=x, val=val, name=rep(letters[1:2],length(x)/2))
df2 <- data.frame(x=x, val2=val+10)
df3 <- data.frame(x=x+3, val=val-10, name="a")
p1 <- ggplot(df3, aes(x=x, y=val, color="a")) +
geom_point(data=df1, aes(x=x, y=val, color=name)) +
geom_line(data=df2, aes(x=x, y=val)) +
ggtitle("Yipes")
p2 <- ggplot(df3, aes(x=x, y=val, color="a")) +
geom_point(data=df1, aes(x=x, y=val, color=name)) +
geom_line(data=df2, aes(x=x, y=val), inherit.aes=FALSE) +
ggtitle("Correct")
p1 + p2
For those of you who want to connect R Studio to GitHub (so you can push to GitHub), here are some notes from the chat today. You don’t need to do this if you use GitHub Desktop; this is if you want to push to GitHub from within RStudio.
BTW, whenever I want to double check my repo origin, I open the file config
in the .git
folder. You’ll see your repo info listed there. Um, you change it here also and this is the first place I look if my repo is having trouble connecting to GitHub.