Reproducible Blogposts

By Charlie Joey Hadley | February 14, 2018

Writing blogposts that meet all four of these requirements is hard:

  • The post is interesting and has something unique to it
  • The post is easy to understand and both analyses and data visualisations follow sensible best practices
  • The code is fully reproducible with all datasets available programmatically to folks replicating the code
  • The author doesn’t give up writing because one of these can’t easily be met

In an ideal world, datasets are provided with DOI and/or else easily accessible via an API. If we’re really lucky there’ll be a lovely R package that wraps up the API for us; as is the case for the excellent WDI package that makes the World Bank data available to us. Here’s a nice example of how we can reproducibly obtain and tabulate data using WDI and DT:


WDI(country=c("DE","NZ"), indicator="IT.NET.USER.ZS", start=2000, end=2001) %>%
  mutate(IT.NET.USER.ZS = IT.NET.USER.ZS / 100) %>%
  datatable(colnames = c("Country Short Code",
                         "Individuals using the Internet (% of population)",
            options = list(dom = "t"),
            height = '100px') %>%
  formatPercentage(columns = 'IT.NET.USER.ZS')

Often times, unfortunately we can’t access data programmatically. Kaggle datasets are a frustrating example of this, there are some really exciting and vibrant datasets published there:

But there is no API for the service, and no direct download link for the data files. This is frustrating as if I write a post using one of these datasets, in order to follow along a reader would have to; manually sign up to Kaggle, download the files, put them in the correct place… and that’s just faff. I’ve also run enough training courses to know this goes wrong very easily, and understandably so.

What can I do about this? Well, I’m going to do two things:

  1. Use Gists
  2. Rate the reproducibility of my blogposts

Use Gists and reprex

Twitter has been tremendously useful in my learning of #rstats, and folks in the community are incredibly friendly and helpful. So when I was starting up this blog I asked on Twitter for advice on how to make posts reproducible:

A few folks suggested that I use Gists, which is a great service provided by Github for hosting collections of files and code.

I’ve already used Gists before for asking longer form questions on Twitter. It’s definitely a good stop gap for reproducibility where data files aren’t programmatically available and the license allows me to rehost the data (with citation). So in the future, if I use Kaggle datasets (for example) I will rehost the data on Gist along with the code for analysing the data and a link to the original blogpost.

Lastly, whenever I do host code on Gist I’ll use the excellent reprex library which helps guarantee reproducible code samples. reprex is developed by the awesome @jennybryan.

reproducibility Ratings

I wanted to introduce a way to rate my blogposts in terms of their reproducibility, with the following considerations:

  • Is the author citable?
  • Is the data citable?
  • Is the code citable (and replicable)?

I deciced the best way to do that was to design an insert for the top and bottom of all my posts that looks like this:

It’s been designed so that it supports multiple post authors, multiple datasets and a link to the .Rmd that generated the blogpost you’re reading. The Citable X parts of the insert have a hover text encouraging folks to click on them, which links to a dedicated page explaining the rating system which I’ll keep updated over time.

At the time of writing this blogpost the rating system works as follows:

Not reproducible!

Complete lack of unique identifier or programmatic access to resource.

Should be reproducible!

A potentially volatile link to the resource has been provided, this includes resources like Gists where commercial interests may make them unavailable in the future. This category includes data accessible via API or CRAN-hosted R packages.

Reliably reproducible.

A DOI or other permanent identifier has been given to the resource, future access is highly likely.

For the time being I’m going to be manually adding these components to my posts, but I have a placeholder repository where I’ll package together tools I regularly use for the blog into. If you do take a look at the .Rmd files including these ratings, you’ll notice I use the following trick from RMarkdown to allow for neatly indented HTML that remains unprocessed by knitr:

        <strong>This will render as HTML not preformatted text</strong>

Unfortunately, manually inserting raw HTML into the top of each blogpost made the post summaries on the blog page next to useless. Thankfully, Yihui has a blogpost on modifying hugo post summaries that was fairly easy to implement. To be explicit, I introduced the following code:

<p class="intro">
{{ with .Description }}
{{ $.Scratch.Set "summary" (markdownify .) }}
{{ else }}
{{ $.Scratch.Set "summary" ((delimit (findRE "(<p.*?>(.|\n)*?</p>\\s*)+" .Content) "[…] ") | plainify | truncate (default 200 .Site.Params.summary_length) (default " …" .Site.Params.text.truncated ) | replaceRE "&" "&" | safeHTML) }}
{{ end }}
{{ $.Scratch.Get "summary" }}

… into these two locations, in place of <p class="intro">{{ .Summary }}</p>

# layouts/_default/single.html
# layouts/partial/recent_posts.html

This actually was a blessing in disguise, as I had wanted more flexibility over my post summaries. Now I can not only write completely custom click-bait summaries, I can also include emojis 🐰 and raw HTML by adding this to the YAML of a post:

    description: >
      <li>Bunny! 🐰</li>
      <li>Cat! 🐱</li>