Visualizing Distributions with Raincloud Plots with ggplot2

Posted by Cédric on Sunday, June 6, 2021

Introduction

A few weeks ago, I came across this tweet:

Quickly, people settled that (a) violin plots are not novel at all but were introduced 23 years ago and (b) providing an overview of the data distribution and even the raw data itself might be a good addition. A suitable chart hybrid, consisting of a combination of boxplots, violin plots, and jittered points, is called a raincloud plot.

Raincloud plots were presented in 2019 as an approach to overcome issues of hiding the true data distribution when plotting bars with errorbars—also known as dynamite plots)—or boxplots. Instead, raincloud plots combine several chart types to visualize the raw data, the distribution of the data as density, and key summary statistics at the same time.

Since I am still working (part–time) as a researcher, I also regularly review scientific articles myself. And in contrast to the reviewer above, I argue completely opposite and always share my concerns in case bar charts or boxplots are not suited to convey the study results. Here is a fun meme I made a few months ago when I encountered so–called “dynamite plots” in a paper I was reviewing (plus here is a reworked version with Shrek instead):

The use of other chart types such as violin and raincloud plots to show the distribution or even the raw data is a topic I am since years pretty passionate about. I always cover the topic in detail during my workshops and I have regularly used raincloud plots for various occasions: during my PhD to show that the model fitting was appropriate across simulations (code), as contribution to the #SWDchallenge to illustrate differences in temperatures in Berlin across months (code), or as contribution to #TidyTuesday to visualize the distribution of bill ratios in brush–tailed penguins (code and code):

Some examples for which I used raincloud or violin plots to explicitly highlight the distributions of the values.

The paper itself that introduces raincloud plots comes with a “multi-platform tool for robust data visualization” which consists of a collection of codes and tutorials to draw raincloud plots in R, Python, or Matlab. In January 2021, a revised version was submitted together with a “fully functional R-package” called {raincloudplots}.

However, there may be two drawbacks: First, the package wraps ggplot functionality into one function instead of adding a geom_ and/or stat_ layers (and I definitely prefer to be able to adjust every detail of my plot in traditional ggplot habit). Secondly, the package is (not yet?) on CRAN and needs to be installed from GitHub (which can be problematic in a work context); it is also not available for R version 4 yet. Because of that I also don’t provide examples here, but you might want to check the tutorial of that package.

Here, I show you numerous other ways to create violin and raincloud plots in {ggplot2}. The tutorial is based on a collection of code snippets I shared with the author of the above mentioned tweet.


Aim of this Tutorial 🏁

Visualizing distributions as box-and-whisker plots is common practice, at least for researchers. Even though boxplots are great in summarizing the data, an issue is that the underlying data structure is hidden. In this short tutorial I show you why boxplots can be problematic, how to improve them, and alternative approaches that can be used to show both, summary statistics as well as the true distribution of the raw data.

I show you how to plot different version of violin–boxplot combinations and raincloud plots with the {ggplot2} package. Some use functionality from extension packages (that are hosted on CRAN): two of my favorite packages (1, 2) namely {ggforce} and {ggdist}, plus the packages {gghalves} and {ggbeeswarm}.

Expand to see the preparation steps.
## INSTALL PACKAGES ----------------------------------------------
## install CRAN packages if needed
pckgs <- c("ggplot2", "systemfonts", "ggforce", 
           "ggdist", "ggbeeswarm", "devtools")
new_pckgs <- pckgs[!(pckgs %in% installed.packages()[,"Package"])]
if(length(new_pckgs)) install.packages(new_pckgs)

## install gghalves from GitHub if needed
if(!require(gghalves)) {  
  devtools::install_github('erocoar/gghalves')
}

## INSTALL PACKAGES ----------------------------------------------
library(ggplot2) ## plotting
library(systemfonts) ## custom fonts

## CUSTOM THEME --------------------------------------------------
## overwrite default ggplot2 theme
theme_set(
  theme_minimal(
    ## increase size of all text elements
    base_size = 18, 
    ## set custom font family for all text elements
    base_family = "Oswald")
)

## overwrite other defaults of theme_minimal()
theme_update(
  ## remove major horizontal grid lines
  panel.grid.major.x = element_blank(),
  ## remove all minor grid lines
  panel.grid.minor = element_blank(),
  ## remove axis titles
  axis.title = element_blank(),
  ## larger axis text for x
  axis.text.x = element_text(size = 16),
  ## add soem white space around the plot
  plot.margin = margin(rep(8, 4))
)

Boxplots: The Reviewer’s Dream

Thanks to my “Evolution of a ggplot” blogpost, I am already known as someone who is not a fan of boxplots. And before that impression settles, that’s not correct. Boxplots are great! Boxpots are an artwork combining many summary statistics into one chart type. But in my opinion they are not always helpful1. They also have a high potential of misleading your audience—and yourself. Why? Let’s plot some box-and-whisker plots:

ggplot(data, aes(x = group, y = value)) +
  geom_boxplot(fill = "grey92")

So tell me: How big is the sample size? Are there underlying patterns in the data? Difficult?

Sure, adding a note on the sample size might be considered good practice but it still doesn’t tell you much about the actual pattern.

Expand to show code.
## function to return median and labels
n_fun <- function(x){
  return(data.frame(y = median(x) - 1.25, 
                    label = paste0("n = ",length(x))))
}

ggplot(data, aes(x = group, y = value)) +
  geom_boxplot(fill = "grey92") +
  ## use summary function to add text labels
  stat_summary(
    geom = "text",
    fun.data = n_fun,
    family = "Oswald",
    size = 5
  )

We Can Do Better: Add Raw Data

An obvious improvement is to add the data points. Since we know already that Group 2 consists of 200 observations, let’s use jitter strips instead of a traditional strip plot:

ggplot(data, aes(x = group, y = value)) +
  geom_boxplot(fill = "grey92") +
  ## use either geom_point() or geom_jitter()
  geom_point(
    ## draw bigger points
    size = 2,
    ## add some transparency
    alpha = .3,
    ## add some jittering
    position = position_jitter(
      ## control randomness and range of jitter
      seed = 1, width = .2
    )
  )

Oh, the patterns are very different! Values of Group 1 are uniformly distributed. The values in Group 2 are clustered with a distinct gap around the group’s median! And the few observations of Group 3 are all integers.

We can improve the look a bit further by plotting the raw data points according to their distribution with either ggbeeswarm::geom_quasirandom() or ggforce::geom_sina():

Expand to show codes.
ggplot(data, aes(x = group, y = value)) +
  geom_boxplot(fill = "grey92") +
  ggbeeswarm::geom_quasirandom(
    ## draw bigger points
    size = 1.5,
    ## add some transparency
    alpha = .4,
    ## control range of the beeswarm
    width = .2
  ) 

ggplot(data, aes(x = group, y = value)) +
  geom_boxplot(fill = "grey92") +
  ggforce::geom_sina(
    ## draw bigger points
    size = 1.5,
    ## add some transparency
    alpha = .4,
    ## control range of the sina plot
    maxwidth = .5
  ) 

Violin Plots: The Reviewer’s Nighmare?

Violin plots can be used to visualize the distribution of numeric variables. It’s basically a mirrored density curve, representing the number of data points along a continuous axis.

ggplot(data, aes(x = group, y = value)) +
  geom_violin(fill = "grey92")

By default, the violin plot can look a bit odd. The default setting (scale = "area") is misleading. Group 1 looks almost the same as Group 3, while consisting of four times as many observations. Also, the default standard deviation of the smoothing kernel is not optimal in our case since it hides the true pattern by smoothing out areas without any data.

We can manipulate both defaults by setting the width to the number of observations via scale = "count" and adjusting the bandwidth (bw). Aesthetically, I prefer to remove the default outline:

Expand to show code.
ggplot(data, aes(x = group, y = value)) +
  geom_violin(
    fill = "grey72", 
    ## remove outline
    color = NA, 
    ## width of violins mapped to # observations
    scale = "count", 
    ## custom bandwidth (smoothing)
    bw = .4
  )

The violin plot allows an explicit representation of the distribution but doesn’t provide summary statistics. To get the best of both worlds, it is often mixed with a boxplot—either a complete boxplot with whiskers and outliers or only the box indicating the median and interquartile range (IQR):

Expand to show codes.
ggplot(data, aes(x = group, y = value)) +
  geom_violin(
    fill = "grey72", 
    color = NA, 
    scale = "count", 
    bw = .5
  ) +
  geom_boxplot(
    ## remove white filling
    fill = NA, 
    ## reduce width
    width = .1
  )

ggplot(data, aes(x = group, y = value)) +
  geom_violin(
    fill = "grey72", 
    color = NA, 
    scale = "count", 
    bw = .5
  ) +
  geom_boxplot(
    ## remove white filling
    fill = NA, 
    ## reduce width
    width = .1,
    ## remove whiskers
    coef = 0, 
    ## remove outliers
    outlier.color = NA ## `outlier.shape = NA` works as well
  )

You might wonder: why should you use violins instead of boxplots with superimposed raw observations? Well, in case you have many more observations, all approaches of plotting raw data become difficult. In terms of readability as well as in terms of computation(al time). Violin plots are a good alternative in such a case, and even better in combination with some summary stats. But we also can combine all three…


Raincloud Plots: A Great Alternative

Raincloud plots can be used to visualize raw data, the distribution of the data, and key summary statistics at the same time. Actually, it is a hybrid plot consisting of a halved violin plot, a boxplot, and the raw data as some kind of scatter.

ggplot(data, aes(x = group, y = value)) + 
  ## add half-violin from {ggdist} package
  ggdist::stat_halfeye(
    ## custom bandwidth
    adjust = .5, 
    ## adjust height
    width = .6, 
    ## move geom to the right
    justification = -.2, 
    ## remove slab interval
    .width = 0, 
    point_colour = NA
  ) + 
  geom_boxplot(
    width = .12, 
    ## remove outliers
    outlier.color = NA ## `outlier.shape = NA` works as well
  ) +
  ## add dot plots from {ggdist} package
  ggdist::stat_dots(
    ## orientation to the left
    side = "left", 
    ## move geom to the left
    justification = 1.1, 
    ## adjust grouping (binning) of observations 
    binwidth = .25
  ) + 
  ## remove white space on the left
  coord_cartesian(xlim = c(1.2, NA))

Here, I combine two layers from the {ggdist} package, namely stat_dots() to draw the rain and stat_halfeye() to draw the cloud. Both are plotted with some justification to place them next to each other and make room for the boxplot. I also remove the slab interval from the halfeye by setting .width to zero and point_colour to NA. The plot needs some manual styling and the values for justification and the number of bins depends a lot on the data. To get rid of the white space on the left, we simply add a limit the x axis.

Expand to show plot without limit adjustment.

One can also solely rely on layers from the {ggdist} package by using the default halfeye which consists of a density curve and a slab interval:

ggplot(data, aes(x = group, y = value)) + 
  ggdist::stat_halfeye(
    adjust = .5,
    width = .6, 
    ## set slab interval to show IQR and 95% data range
    .width = c(.5, .95)
  ) + 
  ggdist::stat_dots(
    side = "left", 
    dotsize = .8, 
    justification = 1.05, 
    binwidth = .3
  ) +
  coord_cartesian(xlim = c(1.2, NA))

While I love the reduced design and the possibility to indicate two different ranges (here the interquartile range and the 95% quantile) I admit that this alternative is less intuitive and potentially even misleading since they look more like credible intervals than boxplots. Maybe a bit like the minimal boxplots proposed by Edward Tufte but still I definitely would add a note to be sure the reader understands what the slabs show.

Of course, one could also add a true jitter instead of a dot plot or even a barcode. Now, I use geom_half_dotplot() from the {gghalves} package. Why? Because it comes with the possibility to add some justification which is not possible for the default layers geom_point() and geom_jitter():

ggplot(data, aes(x = group, y = value)) + 
  ggdist::stat_halfeye(
    adjust = .5, 
    width = .6, 
    .width = 0, 
    justification = -.2, 
    point_colour = NA
  ) + 
  geom_boxplot(
    width = .15, 
    outlier.shape = NA
  ) +
  ## add justified jitter from the {gghalves} package
  gghalves::geom_half_point(
    ## draw jitter on the left
    side = "l", 
    ## control range of jitter
    range_scale = .4, 
    ## add some transparency
    alpha = .3
  ) +
  coord_cartesian(xlim = c(1.2, NA), clip = "off")

Note that the {gghalves} packages adds also some jitter along the y axis which is far from optimal. The packages provides a half–boxplot alternative as well but I personally will never consider or recommend these as an option2. Also, note that the {gghalves} package is not available on CRAN (yet) as well.

A good alternative may to place the jittered points on top of the boxplot by using geom_jitter() (or geom_point()):

ggplot(data, aes(x = group, y = value)) + 
  ggdist::stat_halfeye(
    adjust = .5, 
    width = .6, 
    .width = 0, 
    justification = -.3, 
    point_colour = NA) + 
  geom_boxplot(
    width = .25, 
    outlier.shape = NA
  ) +
  geom_point(
    size = 1.3,
    alpha = .3,
    position = position_jitter(
      seed = 1, width = .1
    )
  ) + 
  coord_cartesian(xlim = c(1.2, NA), clip = "off")

As a last alternative, I also like to use barcode strips instead of jittered points:

ggplot(data, aes(x = group, y = value)) + 
  ggdist::stat_halfeye(
    adjust = .5, 
    width = .6, 
    .width = 0, 
    justification = -.2, 
    point_colour = NA
  ) + 
  geom_boxplot(
    width = .15, 
    outlier.shape = NA
  ) +
  geom_point(
    ## draw horizontal lines instead of points
    shape = 95,
    size = 10,
    alpha = .2
  ) +
  coord_cartesian(xlim = c(1.2, NA), clip = "off")

As you can see, it’s not optimal here since the actual values of Group 3 are hard to spot because they fall exactly along the summary stats provided by the box plot. One could use geom_half_point(), however, I want to avoid the added jittering along the y axis.

Expand to show barcode version with {gghalves}.
ggplot(data, aes(x = group, y = value)) + 
  ggdist::stat_halfeye(
    adjust = .5, 
    width = .6, 
    .width = 0, 
    justification = -.2, 
    point_colour = NA
  ) + 
  geom_boxplot(
    width = .15, 
    outlier.shape = NA
  ) +
  ## add justified jitter from the {gghalves} package
  gghalves::geom_half_point(
    ## draw bar codes on the left
    side = "l", 
    ## draw horizontal lines instead of points
    shape = 95,
    ## remove jitter along x axis
    range_scale = 0,
    size = 10, 
    alpha = .2
  ) +
  coord_cartesian(xlim = c(1.2, NA), clip = "off")


Next: Let Your Plot Shine

Stay tuned, in the upcoming follow-up post I am going to show you how to polish such a raincloud plot and, if you like, how to turn it into a colorful version with annotations and added illustrations:


Footnotes

1 In my opinion, in case your audience is not well trained in statistical concepts and visualizations, consider using something else than a boxplot. Or make sure to explain it sufficiently.
I personally doubt that the general audience is well aware of how to interpret boxplots or how they can be misleading. And even within your institution or university: give it a try, go around (once the lockdown is over or remotly) and ask your colleagues what the thick line in the middle of a boxplot represents. Go back

2 Why? Because the boxplot itself can be reduced easily in width without you getting into trouble. At the same time, I believe that these “half–boxplots” have an uncommon look and thus the potential to confuse readers. Go back