Getting more from R-Markdown

Why do people use R-Markdown in the first place?

In brief, R-Markdown makes it much easier to do reproducible science.

If you’re currently conducting research using tools like Excel, SPSS, Matlab, or SAS for your analyses and a word processor (like MS Word) for your writing, then the most important reproducibility change you can make is to move your data cleaning and analysis to a scripted analysis program like R.

Interested in reading about why R is better than point-and-click tools for reproducible analysis? Check out our article on why you should consider scripted analysis.

If you’re already using R, then you might very reasonably wonder what R-Markdown could possibly add. There are two big reasons a lot of people like using R-Markdown for research:

It lets you combine code and writing, which means no copy-pasting from results output to your write-up — you can have the code automatically fill in numbers, tables, and figures in the right places in your document (check out this article to read more about why researchers like literate statistical programming tools like R-Markdown).
It lets you use the same awesome version control tools for writing that you’re already using for your code.

Not using version control for your code? Learn about how to get started with version control here!

Things you can make with an R-Markdown file

Most people start with basic reports in R-Markdown: A series of code chunks to read in and analyze your data, with a little text between the chunks summarizing and explaining what the code is for.

But when they feel they need to do more “serious” writing, a lot of people switch out of R-Markdown and move to something like Word or Powerpoint. There’s no need! You can turn an R-Markdown document into almost anything, and a number of lovely R packages make the process very streamlined. Here’s a quick list (with links!) of some of the more popular types of documents you might want to write in R-Markdown:

A typical research project produces quite a few documents of different types — maybe an initial slide deck to present the study design at a lab meeting, a conference poster, a series of blog posts related to the work, several analysis reports from different stages of the study, a journal article draft, and then a revised article formatted for another journal, and finally a public-facing website to publicize the work.

You can do all of that in R-Markdown, which means you can take advantage of all of the wonderful features of literate statistical programming and forever free yourself from copy-pasting output. It also means you can keep everything for your study in an RStudio Project, a powerful way to organize related files.

Pro tips for writing R-Markdown in RStudio

There is a great guide to getting started with R-Markdown on the R Studio website. If you’re new to RStudio, start there.

The rest of this post assumes some familiarity with R-Markdown and RStudio already; if you’re just getting started, some of these tips will be overkill for you right now, but skim through anyway and just run with whatever bits seem most useful to you where you are now. You can always come back later!

Projects

RStudio offers something called “projects”, which is a powerful way to organize related files. Software Carpentry has a great guide for researchers on getting started with RStudio projects.

With projects, you can have a protected R session for each of the projects you’re working on — creating objects or loading libraries in one project won’t affect the environment in another. This helps keep your project more reproducible, something both your collaborators and Future You will thank you for. Projects also provide a good way to use relative paths, which is also much more reproducible.

Projects also…

encourage good research data management practices by keeping all your project files together in one directory
provide a tidy way to use git for version control
connect seamlessly with GitHub repositories
support package management (another big win for reproducibility)

If you’re not already using projects in RStudio, give it a try!

Visual editor

When most people start writing markdown in RStudio, they write in the default editor pane. This means writing in markdown, which looks a bit different from how your final knit document will look. This is sometimes called writing in “source” since you’re editing the source file that will eventually be knit into your final output document.

If you like writing and reading markdown (I know I do!), then great, no need to change. But if you find yourself missing a WYSIWYG (What You See Is What You Get) editing environment where your formatting shows up in the document as you type, then check out RStudio’s visual editor!

The document you write in will look much more like your final knit version, and you’ll see lots of handy buttons and shortcuts for things like inserting tables, lists, math equations, etc.

One very valuable feature available in the visual editor is citation insertion directly from a linked zotero library. That means that if you organize your pdfs in zotero, you can have RStudio peek into your zotero collection for you and let you drop in any citation you need while you’re writing. For more on the joys of adding citations to your markdown writing, see citations.

This is only a small taste of what the visual editor does! To learn more, see the RStudio website: https://rstudio.github.io/visual-markdown-editing/#using-the-editor.

Knitting button

When you are working on an R-Markdown file in RStudio, you’ll see a button at the top of the editor pane that says “Knit” and shows a little ball of yarn. If you click this button, RStudio will knit your document to its default output format (you can set the output format in the YAML header). You can also use the keyboard shortcut (SHIFT-⌘-K on a Mac, or SHIFT-CTRL-K on a PC) to do the same thing.

If you click the little dropdown arrow next to the Knit button, you’ll likely see several additional options for different kinds of output (html, pdf, Word, etc.), so if you want to play around with other output options, you can do so right there rather than editing your YAML header.

This button can do a lot more, though!

In the dropdown arrow menu, you’ll also see an option to knit with parameters. This lets you build reports that can work flexibly for different versions of the same analysis. For example, if you need to generate the same report for each of several different regions, you can just write one R-Markdown document for it and set “region” as a parameter, so that when you knit you choose which region to do. You could also use a parameter for the report audience, if, for example, you wanted to share the same findings with a group of fellow researchers who would want to see all the code and with a set of executives who would appreciate a shorter report.

You can also reprogram the knit button itself to do exactly what you want! Under the hood, the knit button is calling rmarkdown::render(), and when you select other options for it (output format, parameters, etc.) what’s really happening is that you’re changing the arguments to the render function. If you want to have more control over the names of your output files, directories used, the environment that’s being referenced, etc. you can write your own call to render as part of your YAML and then that’s what will execute when you click the knit button.

Templates

If you go to File > New File > R Markdown… and then click “Okay” it will open up not a blank file but a partially populated example file that shows you what a basic R-Markdown document looks like. You then edit that file to put in your own content, and you’re good to go. That’s an example of a template, and many people don’t realize that there are lots of different R-Markdown templates available!

Try going to File > New File > R Markdown… again, but this time instead of just clicking “Okay”, select “From Template” in the menu on the left. You’ll see a list of the templates that are already available to you, based on the R packages you have installed.

When you install R packages, some will come with R-Markdown templates. When you have the package installed, its templates will start automatically showing up for you when you create new R-Markdown documents. This can be a huge time saver!

The rticles package is an R package that’s almost all templates. It includes a long list of R-Markdown templates specifically designed to meet the formatting requirements for academic journals. You can select the template you need based on the journal you’ll be submitting to, and you’ll get an R-Markdown template with all the fiddly journal formatting figured out and ready to go — a huge time saver!

You can also write your own templates! This is great if you or your team frequently need to create documents that match a certain style or format. Templates can also be a good teaching tool; you could use templates to onboard new team members by having them start with a template of your standard analysis steps when they work on a new analysis.

Document outline

At the top of the editor pane in RStudio, you’ll see a few buttons in the upper right corner. One looks like a small outline, and if you click it, it will show you your document outline. The keyboard shortcut is SHIFT-⌘-O on a Mac, or SHIFT-CTRL-O on a PC.

It’s always a good idea to use sensible header structures in your documents to improve their accessibility. Another bonus to keeping a clear document structure in your headings is that you can automatically generate a useful outline for navigating your document.

There’s another way to navigate your document: at the bottom of the editor pane, you’ll see a little box that gives the name of whatever section or code chunk your cursor is currently in. If you click that box, it expands to a full list of all of your headers and code chunks! You can select any of them to jump to that point in the document.

Citations

If you’ve ever answered the question “What are you doing this afternoon?” with “Reformatting my bibliography,” then I have great news for you.

With a citation manager, you never, ever have to format your own citations, either in text or in the bibliography at the end — you just drop in a unique tag for the works you want to cite wherever you want to cite them, and you get an automatically generated list of works cited at the end, as well as properly formatted inline citations throughout. Need to change citation style? Just tweak a single parameter in the heading of your file and re-knit, and everything is in the new style. It truly is a game changer for people who do academic writing.

If you’re using RStudio, then I recommend picking Zotero as your citation manager because it can integrate directly with RStudio. It’s also free and open source.

Note that if you have another citation manager you prefer, you can still use those citations in you R-Markdown document as long as you can export them as a .bib file from your citation management software.

Once you have a library of pdfs available in Zotero, you can link it directly to RStudio. Then, to include citations in your writing, use pandoc citation syntax, which looks like this:

Following work by @Mendoza2018, this study seeks to replicate the original effect.
We used the instruments from the classical paradigm [@Jackson1998; @Jackson2001],
as well as the updated short form [@Mendoza2016] to examine possible differences
in participants' responses.

In this example, “Mendoza2016”, “Jackson1998”, etc. are the unique identifiers for articles in my zotero library. The first citation will appear as part of the sentence since it isn’t surrounded by brackets (“Following work by Mendoza, Kene, and Strause (2018), this study…”), and the subsequent citations will be regular inline citations.

RStudio provides a quick way to search for the unique identifiers for your articles, so you don’t have to remember them. In the visual editor, you’ll see a small @ icon at the top of the editor pane. Click that to open the citation insertion window. You can search for your articles by author, key word, title, tag, etc. and RStudio will pull up the unique identifier for you and insert the citation into your writing.

You’ll also notice RStudio offers autocomplete suggestions for you as you type if you start putting in a citation manually. For example, if you start typing @Mend, you should see a little popup of suggestions of all of your citations that are similar to what you’ve started typing.

For more on inserting citations in RStudio, see this tutorial: https://rstudio.github.io/visual-markdown-editing/citations.html#inserting-citations.

Cheatsheets

Did you know there are cheatsheets available right in RStudio?

Go to the Help menu at the top of the screen, and then select Cheatsheets. You’ll see several great topics available, including one on R-Markdown!

Common sources of confusion

In most cases, the .rmd file itself is not your end goal – you want an article or a dashboard or something to be rendered (aka “knit”) from that R-Markdown file. A lot of the time, you can blissfully ignore the details of the process that takes you from .rmd to final document, but as you start using R-Markdown for more things, eventually you’ll run into issues here.

If you’re not running into problems knitting your .rmd files, then feel free to ignore this section!

There are different flavors of markdown

Markdown is really popular, and it’s now being used for a ton of different applications.

The basics of markdown (e.g. headers, ordered and unordered lists, bold, italics, hyperlinks) work pretty much the same in every markdown flavor. This is good because it means that chances are you can parse your markdown with pretty much any parser you like and get fine results.

It’s also confusing, though, because it makes it very difficult to tell from looking at a .md or .rmd document what flavor of markdown it was written for, since they all look so similar. It also means that if you search online for how to do things in markdown, you may get results for a parser that’s different from the one you’ll be using. Because they’re all so similar, the code you find may still work, but there’s a chance it won’t and that can be very frustrating.

Here are a few of the flavors of markdown you’re most likely to encounter:

Most (but not all!) knitting done via RStudio uses pandoc markdown.
Another very wide-spread flavor of markdown is GitHub flavored markdown (often abbreviated GFM); this is the markdown parser that runs on github.com and renders your plain README.md into a nicely formatted document.
The original version of Markdown, with no extensions, is often called “strict” markdown.

Most flavors of markdown are very, very similar to each other. For example, if you copy-paste some markdown text from an R-Markdown file into a README.md file you’re writing, it will likely work exactly the same in RStudio (where it’s rendered with pandoc) and on github.com (where it’s rendered with GFM). There’s a chance it won’t, though, in which case you’ll be left with some unexpected formatting in your rendered file.

Every once in a while, something won’t translate from one markdown flavor to another, which is why it’s important to be aware of this as a possible issue.

Naming your code chunks

You don’t have to provide names for any of your code chunks (knitr will automatically give them uninspired names like “unnamed-chunk-1” and “unnamed-chunk-2” when it knits), but you may want to.

Naming your code chunks has a couple advantages:

It can make it easier for you to navigate a long R-Markdown document in RStudio (see Pro tips for writing R-Markdown in RStudio)
Files created during that code chunk (e.g. images saved from plots) will automatically be named based on the name of the chunk, so an informative chunk name results in informative file names.

There are a few things you have to watch out for when you name code chunks, though:

Each chunk must have a unique name – you can’t have two chunks with the same name in one .rmd document or it won’t be able to knit.
Depending on the markdown parser you’re using, certain kinds of chunk names may result in errors during knitting.

The issue of unique chunk names frequently comes up when you have a document where the same overall structure is repeated two or more times. For example, in an article where you present the results of three experiments, you may have a header for Study 1 with subheaders Data, Model, and Results, and then Study 2 with the same three subheaders, etc. If you have code chunks in each of those sections, your first instinct may be to name them corresponding to their subheaders (e.g. “data”, “model”, “results”), but you’ll get an error when you try to knit. Instead, you need to differentiate the chunks from each study, so something like “data-study1” is better, so that every chunk in the document has a unique name.

Cache

There are many options you can add to code chunks in R-Markdown files. See the full list of chunk options here.

One very useful-but-dangerous option is cache.

By default, no code chunks are cached, which means whenever you knit your document all of the code gets run fresh (unless you’ve turned off code evaluation). If you have a chunk that is very time consuming to run (e.g. maybe the first chunk of code downloads a large dataset), you may not wish to execute that code again every time you want to knit. You can set cache=TRUE for that chunk, and then it will only run the first time you knit and simply “cache” the result of that chunk to use for any future knits. Then when you knit again, it will run much faster since it doesn’t waste time re-running the cached chunk.

It can be easy to forget you’ve cached a chunk though!

For example, maybe that large data file you were downloading gets updated every once in a while — if you’ve cached the code to download it, then you won’t be getting that updated data unless you turn caching back off.

And if you’ve cached a chunk that’s part way through a process (e.g. step B of A > B > C), it can create really confusing results when you update other chunks and re-knit only to find your results aren’t changing as expected.

For example, let’s say you have a chunk A that reads in your data and does a little basic cleaning, then chunk B that runs a really resource-intensive analysis (maybe some cross-validated models), and finally chunk C that generates plots of your model results. You cache chunk B since it takes so long to run. After a few edits, you decide to make some changes in your data cleaning (chunk A) and then re-knit. But none of the plots reflect the new data cleaning you did! That’s because the updated output from A isn’t being re-run in the modeling in step B; it’s just re-using the results from the first time you knit it and then passing that on to chunk C.

Being able to cache a chunk is great and can save you tons of time, but use it carefully. If you are caching some of your chunks, it’s a good idea to periodically turn the caching off and re-knit just to make sure there are no surprises.

Tags