Statistical Programming Languages

This article doesn’t address paid software like Matlab or SPSS or SAS. These are all fantastic data and statistical software packages, often quite expensive, that you may already use in your research. Why should you consider adding an open-source statistical programming language to your toolkit?

  • R and Python are free. No expense for your lab, and no need to tailor scripts or datasets to meet the needs of collaborators – they too can download these for free. If you’ve ever had to deal with SAS code when all you have is SPSS, or vice versa, you know what a boon this is.
  • R and Python get innovative packages more quickly than commercial software. Whether you’re doing novel tasks (like data mining or network analysis) or novel statistical methods, it is likely that R and Python will be able to perform these months or years before commercial programs will.
  • Data visualizations in R and Python are infinitely customizable. Need some unusual graphs, or need to annotate them in a specific way? It’s doable in these programming languages, but less so in SPSS, SAS, etc.
  • Coding special pipelines and connecting your data processing to your analysis is simple in statistical programming languages.

Is there a learning curve associated with learning R or Python? Certainly. Is the reward worth the effort? Absolutely. There is a lot of support out there from other users of R and Python, and this Arcus Education site is dedicated to helping you learn as well.

Let’s consider some of the benefits of each of these languages:

Python

  • Is more memory-efficient with R and better for very large datasets.
  • May seem more familiar if you have a background in programming.
  • May be better in your pipeline, if you have to do a lot of file manipulation.
  • Is multi-purpose, not just a statistical tool. You can build desktop apps, websites, and more using Python. It easily supports the development of software (not just scripts).

R

  • Generally is easier to install and get running than Python
  • Has a shorter learning curve than Python for new programmers.
  • Has a very good IDE, RStudio, that is the industry standard.
  • Is strictly statistical – you won’t go down any internet rabbit holes learning about things you aren’t interested in like website development, as you might with Python.
  • Tends to have more consensus about the “right way” to do things, as compared to the wild wild west of Python.

Both

  • Support literate statistical programming.
  • Support REPLs (Read, Evaluate, Print Loops) that allow you to code on-the-fly instead of writing a whole script and hoping it works.
  • Have a strong support community of academic researchers online and at CHOP

In short:

If you have smallish data (files in the hundred thousand rows or fewer, < 500mb scope) and you’re new to programming, try R first. Familiar with programming or have big data? Go for Python.