##### Table of Content

- Introduction
- Aim of this Tutorial 🏁
- Boxplots: The Reviewer’s Dream
- We Can Do Better: Add Raw Data
- Violin Plots: The Reviewer’s Nighmare?
- Raincloud Plots: A Great Alternative
- Let Your Plot Shine

## Introduction

A few weeks ago, I came across this tweet:

Twitter friends - I love violin plots but a reviewer is concerned that “their novelty might distract from the message”.

— Alexandra Lautarescu (@AleLautarescu) March 22, 2021

Are there better alternatives, or ways to clarify/address this?

Quickly, people settled that **(a) violin plots are not novel at all** but were introduced 23 years ago and **(b) providing an overview of the data distribution and even the raw data itself might be a good addition**. A suitable chart hybrid, consisting of a combination of boxplots, violin plots, and jittered points, is called a ** raincloud plot**.

Raincloud plots were presented in 2019 as an approach to overcome issues of hiding the true data distribution when plotting bars with errorbars—also known as *dynamite plots*)—or boxplots. Instead, raincloud plots combine several chart types to visualize the raw data, the distribution of the data as density, and key summary statistics at the same time.

Since I am still working (part–time) as a researcher, I also regularly review scientific articles myself. And in contrast to the reviewer above, I argue completely opposite and always share my concerns in case bar charts or boxplots are not suited to convey the study results. Here is a fun meme I made a few months ago when I encountered so–called “dynamite plots” in a paper I was reviewing (plus here is a reworked version with Shrek instead):

That moment when you review a journal submission and you see dynamite plots: rage and joy at the same time! #dataviz #datavisualization #barbarplot #DoBetter

— Cédric Scherer (@CedScherer) March 26, 2021

The use of other chart types such as violin and raincloud plots to show the distribution or even the raw data is a topic I am since years pretty passionate about. I always cover the topic in detail during my workshops and I have regularly used raincloud plots for various occasions: during my PhD to show that the model fitting was appropriate across simulations (code), as contribution to the #SWDchallenge to illustrate differences in temperatures in Berlin across months (code), or as contribution to #TidyTuesday to visualize the distribution of bill ratios in brush–tailed penguins (code and code):

The paper itself that introduces raincloud plots comes with a “multi-platform tool for robust data visualization” which consists of a collection of codes and tutorials to draw raincloud plots in R, Python, or Matlab. In January 2021, a revised version was submitted together with a “fully functional R-package” called `{raincloudplots}`

.

However, there may be two drawbacks: First, the package wraps ggplot functionality into one function instead of adding a `geom_`

and/or `stat_`

layers (and I definitely prefer to be able to adjust every detail of my plot in traditional ggplot habit). Secondly, the package is (not yet?) on CRAN and needs to be installed from GitHub (which can be problematic in a work context); it is also not available for R version 4 yet. Because of that I also don’t provide examples here, but you might want to check the tutorial of that package.

Here, I show you numerous other ways to create violin and raincloud plots in `{ggplot2}`

. The tutorial is based on a collection of code snippets I shared with the author of the above mentioned tweet.

## Aim of this Tutorial 🏁

Visualizing distributions as box-and-whisker plots is common practice, at least for researchers. Even though boxplots are great in summarizing the data, an issue is that the underlying data structure is hidden. In this short tutorial I show you why boxplots can be problematic, how to improve them, and alternative approaches that can be used to show both, summary statistics as well as the true distribution of the raw data.

I show you how to plot different version of violin–boxplot combinations and raincloud plots with the `{ggplot2}`

package. Some use functionality from extension packages (that are hosted on CRAN): two of my favorite packages (1, 2) namely `{ggforce}`

and `{ggdist}`

, plus the packages `{gghalves}`

and `{ggbeeswarm}`

.

##
*Expand to see the preparation steps.*

```
## INSTALL PACKAGES ----------------------------------------------
## install CRAN packages if needed
pckgs <- c("ggplot2", "systemfonts", "ggforce",
"ggdist", "ggbeeswarm", "devtools")
new_pckgs <- pckgs[!(pckgs %in% installed.packages()[,"Package"])]
if(length(new_pckgs)) install.packages(new_pckgs)
## install gghalves from GitHub if needed
if(!require(gghalves)) {
devtools::install_github('erocoar/gghalves')
}
## INSTALL PACKAGES ----------------------------------------------
library(ggplot2) ## plotting
library(systemfonts) ## custom fonts
## CUSTOM THEME --------------------------------------------------
## overwrite default ggplot2 theme
theme_set(
theme_minimal(
## increase size of all text elements
base_size = 18,
## set custom font family for all text elements
base_family = "Oswald")
)
## overwrite other defaults of theme_minimal()
theme_update(
## remove major horizontal grid lines
panel.grid.major.x = element_blank(),
## remove all minor grid lines
panel.grid.minor = element_blank(),
## remove axis titles
axis.title = element_blank(),
## larger axis text for x
axis.text.x = element_text(size = 16),
## add soem white space around the plot
plot.margin = margin(rep(8, 4))
)
```

## Boxplots: The Reviewer’s Dream

Thanks to my “Evolution of a ggplot” blogpost, I am already known as someone who is not a fan of boxplots. And before that impression settles, that’s not correct. **Boxplots are great! Boxpots are an artwork combining many summary statistics into one chart type.** But in my opinion they are not always helpful^{1}. They also have a high potential of misleading your audience—and yourself. Why? Let’s plot some box-and-whisker plots:

```
ggplot(data, aes(x = group, y = value)) +
geom_boxplot(fill = "grey92")
```

So tell me: How big is the sample size? Are there underlying patterns in the data? Difficult?

Sure, adding a note on the sample size might be considered good practice but it still doesn’t tell you much about the actual pattern.

##
*Expand to show code.*

```
## function to return median and labels
n_fun <- function(x){
return(data.frame(y = median(x) - 1.25,
label = paste0("n = ",length(x))))
}
ggplot(data, aes(x = group, y = value)) +
geom_boxplot(fill = "grey92") +
## use summary function to add text labels
stat_summary(
geom = "text",
fun.data = n_fun,
family = "Oswald",
size = 5
)
```

## We Can Do Better: Add Raw Data

An obvious improvement is to add the data points. Since we know already that Group 2 consists of 200 observations, let’s use jitter strips instead of a traditional strip plot:

```
ggplot(data, aes(x = group, y = value)) +
geom_boxplot(fill = "grey92") +
## use either geom_point() or geom_jitter()
geom_point(
## draw bigger points
size = 2,
## add some transparency
alpha = .3,
## add some jittering
position = position_jitter(
## control randomness and range of jitter
seed = 1, width = .2
)
)
```

Oh, the patterns are very different! Values of Group 1 are uniformly distributed. The values in Group 2 are clustered with a distinct gap around the group’s median! And the few observations of Group 3 are all integers.

We can improve the look a bit further by plotting the raw data points according to their distribution with either `ggbeeswarm::geom_quasirandom()`

or `ggforce::geom_sina()`

:

##
*Expand to show codes.*

```
ggplot(data, aes(x = group, y = value)) +
geom_boxplot(fill = "grey92") +
ggbeeswarm::geom_quasirandom(
## draw bigger points
size = 1.5,
## add some transparency
alpha = .4,
## control range of the beeswarm
width = .2
)
ggplot(data, aes(x = group, y = value)) +
geom_boxplot(fill = "grey92") +
ggforce::geom_sina(
## draw bigger points
size = 1.5,
## add some transparency
alpha = .4,
## control range of the sina plot
maxwidth = .5
)
```

## Violin Plots: The Reviewer’s Nighmare?

Violin plots can be used to visualize the distribution of numeric variables. It’s basically a mirrored density curve, representing the number of data points along a continuous axis.

```
ggplot(data, aes(x = group, y = value)) +
geom_violin(fill = "grey92")
```

By default, the violin plot can look a bit odd. The default setting (`scale = "area"`

) is misleading. Group 1 looks almost the same as Group 3, while consisting of four times as many observations. Also, the default standard deviation of the smoothing kernel is not optimal in our case since it hides the true pattern by smoothing out areas without any data.

We can manipulate both defaults by setting the width to the number of observations via `scale = "count"`

and adjusting the bandwidth (`bw`

). Aesthetically, I prefer to remove the default outline:

##
*Expand to show code.*

```
ggplot(data, aes(x = group, y = value)) +
geom_violin(
fill = "grey72",
## remove outline
color = NA,
## width of violins mapped to # observations
scale = "count",
## custom bandwidth (smoothing)
bw = .4
)
```

The violin plot allows an explicit representation of the distribution but doesn’t provide summary statistics. To get the best of both worlds, it is often mixed with a boxplot—either a complete boxplot with whiskers and outliers or only the box indicating the median and interquartile range (IQR):

##
*Expand to show codes.*

```
ggplot(data, aes(x = group, y = value)) +
geom_violin(
fill = "grey72",
color = NA,
scale = "count",
bw = .5
) +
geom_boxplot(
## remove white filling
fill = NA,
## reduce width
width = .1
)
ggplot(data, aes(x = group, y = value)) +
geom_violin(
fill = "grey72",
color = NA,
scale = "count",
bw = .5
) +
geom_boxplot(
## remove white filling
fill = NA,
## reduce width
width = .1,
## remove whiskers
coef = 0,
## remove outliers
outlier.color = NA ## `outlier.shape = NA` works as well
)
```

You might wonder: why should you use violins instead of boxplots with superimposed raw observations? Well, in case you have many more observations, all approaches of plotting raw data become difficult. In terms of readability as well as in terms of computation(al time). Violin plots are a good alternative in such a case, and even better in combination with some summary stats. But we also can combine all three…

## Raincloud Plots: A Great Alternative

Raincloud plots can be used to visualize raw data, the distribution of the data, and key summary statistics at the same time. Actually, it is a hybrid plot consisting of a halved violin plot, a boxplot, and the raw data as some kind of scatter.

```
ggplot(data, aes(x = group, y = value)) +
## add half-violin from {ggdist} package
ggdist::stat_halfeye(
## custom bandwidth
adjust = .5,
## adjust height
width = .6,
## move geom to the right
justification = -.2,
## remove slab interval
.width = 0,
point_colour = NA
) +
geom_boxplot(
width = .12,
## remove outliers
outlier.color = NA ## `outlier.shape = NA` works as well
) +
## add dot plots from {ggdist} package
ggdist::stat_dots(
## orientation to the left
side = "left",
## move geom to the left
justification = 1.1,
## adjust grouping (binning) of observations
binwidth = .25
) +
## remove white space on the left
coord_cartesian(xlim = c(1.2, NA))
```

Here, I combine two layers from the `{ggdist}`

package, namely `stat_dots()`

to draw the rain and `stat_halfeye()`

to draw the cloud. Both are plotted with some justification to place them next to each other and make room for the boxplot. I also remove the slab interval from the halfeye by setting `.width`

to zero and `point_colour`

to `NA`

. The plot needs some manual styling and the values for justification and the number of bins depends a lot on the data. To get rid of the white space on the left, we simply add a limit the x axis.

##
*Expand to show plot without limit adjustment.*

One can also solely rely on layers from the `{ggdist}`

package by using the default halfeye which consists of a density curve and a slab interval:

```
ggplot(data, aes(x = group, y = value)) +
ggdist::stat_halfeye(
adjust = .5,
width = .6,
## set slab interval to show IQR and 95% data range
.width = c(.5, .95)
) +
ggdist::stat_dots(
side = "left",
dotsize = .8,
justification = 1.05,
binwidth = .3
) +
coord_cartesian(xlim = c(1.2, NA))
```

While I love the reduced design and the possibility to indicate two different ranges (here the interquartile range and the 95% quantile) I admit that this alternative is less intuitive and potentially even misleading since they look more like credible intervals than boxplots. Maybe a bit like the minimal boxplots proposed by Edward Tufte but still I definitely would add a note to be sure the reader understands what the slabs show.

Of course, one could also add a true jitter instead of a dot plot or even a barcode. Now, I use `geom_half_dotplot()`

from the `{gghalves}`

package. Why? Because it comes with the possibility to add some justification which is not possible for the default layers `geom_point()`

and `geom_jitter()`

:

```
ggplot(data, aes(x = group, y = value)) +
ggdist::stat_halfeye(
adjust = .5,
width = .6,
.width = 0,
justification = -.2,
point_colour = NA
) +
geom_boxplot(
width = .15,
outlier.shape = NA
) +
## add justified jitter from the {gghalves} package
gghalves::geom_half_point(
## draw jitter on the left
side = "l",
## control range of jitter
range_scale = .4,
## add some transparency
alpha = .3
) +
coord_cartesian(xlim = c(1.2, NA), clip = "off")
```

Note that the `{gghalves}`

packages adds also some jitter along the y axis which is far from optimal. The packages provides a half–boxplot alternative as well but I personally will never consider or recommend these as an option^{2}. Also, note that the `{gghalves}`

package is not available on CRAN (yet) as well.

A good alternative may to place the jittered points on top of the boxplot by using `geom_jitter()`

(or `geom_point()`

):

```
ggplot(data, aes(x = group, y = value)) +
ggdist::stat_halfeye(
adjust = .5,
width = .6,
.width = 0,
justification = -.3,
point_colour = NA) +
geom_boxplot(
width = .25,
outlier.shape = NA
) +
geom_point(
size = 1.3,
alpha = .3,
position = position_jitter(
seed = 1, width = .1
)
) +
coord_cartesian(xlim = c(1.2, NA), clip = "off")
```

As a last alternative, I also like to use barcode strips instead of jittered points:

```
ggplot(data, aes(x = group, y = value)) +
ggdist::stat_halfeye(
adjust = .5,
width = .6,
.width = 0,
justification = -.2,
point_colour = NA
) +
geom_boxplot(
width = .15,
outlier.shape = NA
) +
geom_point(
## draw horizontal lines instead of points
shape = 95,
size = 10,
alpha = .2
) +
coord_cartesian(xlim = c(1.2, NA), clip = "off")
```

As you can see, it’s not optimal here since the actual values of Group 3 are hard to spot because they fall exactly along the summary stats provided by the box plot. One could use `geom_half_point()`

, however, I want to avoid the added jittering along the y axis.

##
*Expand to show barcode version with* `{gghalves}`

.

```
ggplot(data, aes(x = group, y = value)) +
ggdist::stat_halfeye(
adjust = .5,
width = .6,
.width = 0,
justification = -.2,
point_colour = NA
) +
geom_boxplot(
width = .15,
outlier.shape = NA
) +
## add justified jitter from the {gghalves} package
gghalves::geom_half_point(
## draw bar codes on the left
side = "l",
## draw horizontal lines instead of points
shape = 95,
## remove jitter along x axis
range_scale = 0,
size = 10,
alpha = .2
) +
coord_cartesian(xlim = c(1.2, NA), clip = "off")
```

## Next: Let Your Plot Shine

Stay tuned, in the upcoming follow-up post I am going to show you how to polish such a raincloud plot and, if you like, how to turn it into a colorful version with annotations and added illustrations:

*Footnotes*

^{1} In my opinion, in case your audience is not well trained in statistical concepts and visualizations, consider using something else than a boxplot. Or make sure to explain it sufficiently.

I personally doubt that the general audience is well aware of how to interpret boxplots or how they can be misleading. And even within your institution or university: give it a try, go around (once the lockdown is over or remotly) and ask your colleagues what the thick line in the middle of a boxplot represents. *Go back* ↑

^{2} Why? Because the boxplot itself can be reduced easily in width without you getting into trouble. At the same time, I believe that these “half–boxplots” have an uncommon look and thus the potential to confuse readers. *Go back* ↑