Articles

July 20, 2022

Using the REDCap API

Tired of working with downloaded .csv files? Want to reach into your REDCap database in a scripted way that always gets the freshest copy of data? Read on!

March 1, 2022

There’s an equation in this paper! Now what?

Anyone reading scientific papers will eventually, if not frequently, come to a line of mathematical symbols that do not make sense. This article is about how to keep reading!

December 27, 2021

Version control your writing

You’ve heard of using version control for code, but have you thought about what version control could do for your writing?

December 22, 2021

You may have heard that R-Markdown files are great for reproducible research, but what do they actually do? This post provides an overview of what you can do with R-Markdown (including properly-formatted journal articles, slide decks, websites, dashboards, Word documents, and more) and links with resources to get you started.

November 22, 2021

How do I collapse data from several columns into one?

Learn how to take similar data stored across several different columns and combine it into a single column in R

November 2, 2021

Getting Started with lasso regression in R

Links to great resources to learn how to conduct lasso regression in R, and other related techniques.

July 30, 2021

R 4 Beginners Chapter 7 - Reading Tabular Data

Learn to import tabular data with the readr package.

July 8, 2021

ANOVA tables in R

This post shows how to generate an ANOVA table from your R model output that you can then use directly in your manuscript draft.

June 21, 2021

Linear Regression in R: Annotated Output

This post shows how to run a linear regression model in R, and how to interpret the output produced, line by line.

June 14, 2021

Understanding Interactions in Linear Models

This post will walk you through how adding an interaction term between a continuous and a categorical predictor changes your model, both statistically and in terms of your real-world interpretation of results.

March 29, 2021

Arcus Data Repository: A Fast Track to Research

Why has Arcus created a new clinical data source? Find out more about the Arcus Data Respository.

February 11, 2021

R 4 Beginners Chapter 6 - Reproducible Programming

Learn the tools for coding reproducibly with R scripts and R Markdown.

January 19, 2021

R 4 Beginners Chapter 5 - Data Transformation

Learn how to get your data in the form you need using dplyr.

January 14, 2021

Josh Taban’s CHOP Internship

How an internship at CHOP led to a love of data analysis with R.

November 20, 2020

R 4 Beginners Chapter 4 - Data Visualization with ggplot2, Part II

Continue learning data visualization with ggplot2, using statisical transformations, plot labels, position adjustments, and coordinate systems.

November 3, 2020

R 4 Beginners Chapter 3 - Data Visualization with ggplot2

Getting started using ggplot2 for data visualization.

November 3, 2020

R 4 Beginners Chapter 2 - Coding Basics

Walking through some basics of writing code for the first time.

November 3, 2020

R 4 Beginners Chapter 1 - Introduction and Installation

Interested in learning R from the very beginning? This is a great place to start!

June 19, 2020

The UMLS Metathesaurus

Confused by all the potential ways to code or describe medical terms? ICD, OMOP, RxNorm, SNOMED – what are all these acronyms? Learn more about an exceptionally helpful tool in this post by librarian and data guru Hannah Calkins.

January 15, 2020

What Type of Machine Learning Should I Use?

Let’s demystify machine learning! In this article, you’ll learn 1) the fundamentals of machine learning and 2) how to translate your research questions into well-scoped ML tasks.

January 2, 2020

The REDCap API and Windows

Have you tried our hints and tips for accessing your REDCap data via the REDCap API but have run into strange ‘TLS’ or ‘SSL’ errors? Do you use Windows? You’ll want to read this piece!

January 2, 2020

The U.S. Census Bureau and Child Health

What does the U.S. Census Bureau have to do with child health? How can you discover and use Census data in your research? Find out more in this first of a series of articles on Census data.

September 30, 2019

Data Sharing and Privacy: A Very Cursory Overview

Can you see / use given data? What are the rules around privacy? When do you need IRB approval? Find out more in this flowchart-driven article!

July 30, 2019

User Groups at CHOP

Are you an R or Python user? Are you interested in either language? You might find CHOP’s User Groups interesting and useful!

July 15, 2019

Cloud Tools for the Unconvinced

Curious about R and Python, but not sure where to start? Not ready to commit to a download? Read on!

April 10, 2019

The Spreadsheet Betrayal

Somewhere along the line that trust that you had fostered soured.

March 29, 2019

Arcus Annotations and cTAKES

A given cTAKES configuration, called a pipeline, has multiple stages (annotators) that perform specific tasks like part-of-speech tagging or detecting term negation. The steps are run in sequence.

March 29, 2019

Arcus Annotations and RDoC

Rather than classifying symptoms into disorders as in the DSM system, any aspect of human experience can be measured, multidimensionally, by scoring it within a carefully designed system of domains, constructs within domains, and sub-constructs on the one hand and units of analysis on the other. These factors are examined ‘in a context emphasizing developmental trajectories and the individual’s interactions with his or her environment.’

March 29, 2019

Arcus Annotations: Harvesting Data from Text Notes

Investigators at CHOP have not had a method for gleaning large amounts of note data for their studies. But they soon will.

March 18, 2019

Feasibility Analysis Using Arcus Cohort Discovery

How can you decide, more quickly, whether your idea for a study is feasible, given CHOP’s patient population? Arcus has a new solution that gives you more autonomy. Learn more about how you can discover cohorts using Arcus Cohort Discovery!

March 15, 2019

Meet the Arcus Library Science Team

Meet the information scientists who are helping Arcus create exciting new solutions for the organization, preservation, and interconnection of CHOP’s research efforts: our Library Science team!

March 11, 2019

Why Archivists and Librarians?

Why does Arcus have a team of highly-trained archivists and librarians as part of our efforts to create a more nimble, innovative, and interconnected research environment at CHOP? Learn about how the original information science, library science, is revolutionizing the way we work with data at CHOP.

February 14, 2019

The Argument Against Aggregation

Not everybody thought taking the average was a good idea.

February 14, 2019

Swirl: Learn R in R

Detailed instructions on how to start learning R using R’s swirl package.

February 14, 2019

Statistics Chapter 1: Measures of Central Tendency and Dispersion

Start here to learn in depth about the theory and practice of statistics.

February 13, 2019

Variable Types

Variable types commonly used in statistical analysis and their basic properties.

February 13, 2019

Do Patterns in Missing Data Matter?

Do missing data represent information from which we might draw conclusions?

February 13, 2019

Tiny Munge

An example of a typical small data munging problem with its solution.

February 6, 2019

Descriptive Statistics: The Bullet

Follow these (very) basic rules about displaying variable types.

October 26, 2018

Date Pairing in R

Do you have data that includes two or more repeated measurements? Need to figure out which ones to use? It gets complicated! Find out how to avoid multiple rows in your merge and choose the right pair of measures for your research.

October 20, 2018

Comparing Parts of Speech with NLTK

In this lab, we compare parts of speech using Natural language processing (NLP) via NTLK. Do presidents differ by party in their State of the Union language? Let’s find out!

September 25, 2018

Data Preparation

How do you get from raw data to something you can do statistical analysis on? And who’s responsible for cleaning your data? Read more about it in this post.

September 24, 2018

Clinical Data in R

Need to work with clinical data? Not sure how to start with the big, messy, raw data you were given? Give this article a read!

September 5, 2018

Best Practices for REDCap Variables and Instruments

Making sure your fields and instruments are named well and properly set up in REDCap will ensure that your data collection goes smoothly. Find out more in this article!

September 5, 2018

Collecting Sex and Gender Data

Making sure your fields and instruments are named well and properly set up in REDCap will ensure that your data collection goes smoothly. Find out more in this article!

September 5, 2018

REDCap Race and Ethnicity Data Collection

The collection of race and ethnicity can be surprisingly inconsistent across studies. Find out the right way to collect this data according to federal standards.

September 5, 2018

REDCap: PHI and Permissions

How can you include PHI in your REDCap database in a safe way? Read this article to find out.

September 5, 2018

REDCap Data Collection Overview

Do you use REDCap to collect data, or to track subjects? You’ll want to read about some common pitfalls and how to avoid them in this series of articles!

September 5, 2018

REDCap Free Text Collection

Allowing free text fields may seem attractive and simple, but this data collection strategy comes at a cost. Find out more about REDCap’s free text fields in this article.

September 5, 2018

REDCap Field Types

Getting field types wrong can mean data ends up difficult or impossible to use at analysis time. Find out how to ensure your REDCap field types are correct in this article.

September 5, 2018

REDCap Free Text Collection

Are you inadvertently combining two or more data points into one field? This is hard to tease out at the end of data collection, and will complicate your research. Find out what we mean by data combining in this article, one of a series of articles to improve your experience of REDCap.

August 28, 2018

My File is Over There: File Paths for Data Scientists

Let’s say that you have some .csv files locally on your computer, and you want to load them into R or Python. You’re working in RStudio or a Jupyter notebook, and you’re not sure how to point to the file you want to bring in. This can be considerably painful if you are new to the concept of file paths. If you’re new to writing code, or you’ve encountered problems with this, read on!

August 14, 2018

FIPs and the Belmont Report: Divergence

The FIPs and the Belmont Report treat bias and discrimination differently. Want to learn more? Read this third of a series of three articles on FIPs and the Belmont Report.

August 13, 2018

Ordinary Linear Regression in R

Want to learn how to do ordinary linear regression in R? Read on!

August 9, 2018

Null Hypothesis Statistical Testing (NHST)

If it’s been awhile since you had statistics, or you’re brand new to research, you might need to brush up on some basic topics. In this article, we’ll take on hypothesis testing.

August 7, 2018

Python Lab for Beginners

In this lab, we’ll walk you through what to do when you get a .csv – how to bring it into Python, do some data cleaning, gather summary statistics for reporting, and do some initial data visualizations. This is a great place to start if you’re brand new to Python!

August 7, 2018

FIPs and the Belmont Report: Similarities

Both the FIPs and the Belmont Report emphasize the importance of obtaining a subjects’ proper consent, although each accomplishes this through different means. Want to learn more? Read this second of a series of three articles on FIPs and the Belmont Report.

August 6, 2018

Linear Algebra, a Geometric Approach

Linear Algebra, what even is it? If you’re baffled by linear algebra notation or the role of linear algebra in statistical metrics, or you just want to improve your basic grasp of linear algebra, check out this article.

August 2, 2018

The p Value Controversy

Why are p values under fire? In this article, we try to explain a bit of why the p value is often considered insufficent evidence by many statisticans and researchers.

August 1, 2018

Customizing ggplot2 Visualizations With ggThemeAssist

Ever struggle with getting your ggplot2 visualization to meet all of your needs? Tired of having to go to Stack Overflow every time you prepare a graph for publication? Read this piece.

July 31, 2018

FIPs and the Belmont Report: Principles

The Fair Information Practices (FIPs) and the Belmont Report are two key sets of principles that have been implemented into laws both nationally and internationally since their respective publications in 1973 and 1976. These principles have helped shape the regulations that guide researchers. Want to learn more? Read this first of a series of three articles on FIPs and the Belmont Report.

June 18, 2018

Social Justice and Data Science

Researchers have a special duty to ensure that their research is just. Data scientists, also, need to keep social justice in mind when analyzing and preparing data.

June 14, 2018

Intro to the Linux Command Line

Need to work with the Linux command line, and you’ve never done it? Not sure where to start? Start here!

June 8, 2018

Mapping Environmental Exposures

Ever want to more detail about the environmental context of your subjects? Want to map outcomes, disparities, or healthcare realities? This hands-on code lab shows you how to map Philadelphia’s recent shooting history alongside your research data.

June 4, 2018

What is an API?

Why is Arcus interested in creating APIs for researcher use? What even is an API, and why would you use it?

May 25, 2018

What is Metadata?

The Arcus Data Catalog will use metadata to facilitate data discovery. Why is metadata so important in research?

May 25, 2018

Arcus Clinical Cohorts

Defining a clinical cohort can be challenging, and can represent many hours of work. How can we make this effort be more productive and benefit other researchers? Find out how Arcus is working to improve clinical cohort definition in this article.

May 18, 2018

Arcus Data Catalog

Ever resort to Google to find CHOP researchers? Wonder what else is going on in the Research Institute, but not sure how to find out? An important tool Arcus will bring to the research process at CHOP is a Data Catalog that helps you find research that’s pertinent to you.

April 1, 2018

Code Readability

How can I make my code more readable? A few tips and tricks for those who want to make sure their code is understandable.

March 21, 2018

Getting to one row

Do you have a pile of data about patients or research subjects, and you want one row per person, but can’t seem to get there easily? Find out why and how to solve this conundrum!

March 20, 2018

What is SQL?

Do you have data that’s held in a SQL database? Find out more about this important database technology and what you really need to know to get started.

March 14, 2018

Data Combining in R

Do you need to combine a few different datasets into one? R excels at this. Find out about merging, column binding, and row binding here!

March 13, 2018

ggplot overview

Do you want to make useful, attractive data visualizations in R? ggplot is the visualization solution you’re looking for!

March 11, 2018

Intro to Machine Learning: Trees

What is predictive, supervised machine learning? Can you do it in R? Find out more by examining one machine learning algorithm here!

March 7, 2018

Natural Language Processing with NLTK

Natural language processing (NLP) will come in handy if you analyze things like physician notes or language samples from research subjects. It allows you to examine language in various ways. Try a brief lab in working with a language sample!

March 6, 2018

Excel, if you must…

The use of Excel in science is a hotly debated issue with strong feelings. If you choose to use Excel, you should understand why it’s so controversial and how to use it as safely as possible!

March 5, 2018

Regex 101

Regex is a way to find strings that match a pattern you’re looking for. It’s handy in data processing, as well as in writing scripts. Read more about this skill (some would say arcane art) in this post.

March 5, 2018

High Performance Computing

What is high-performance computing? What resources does CHOP have available for your computing needs, when your laptop just won’t cut it? Find out more in this article.

February 28, 2018

Arcus’s Virtual Biobank

What is a virtual biobank? How does Arcus support genome-phenome studies?

February 28, 2018

Understanding Pearson’s r

Ever feel like you don’t have an intuitive grasp of what Pearson’s R correlation score is? You might know that scores with an absolute value close to 1 are useful, but why? And what’s the relationship between correlation and a linear model? Find out more here!

February 28, 2018

Statistical Programming Languages

What statistical programming language should you use?

February 28, 2018

Intro to NetworkX

NetworkX is a python package you can use to do graph analysis or construct network diagrams. Read more and run code in Python 3 to see how this module works!

February 22, 2018

Why Use Literate Statistical Programming?

Why does literate statistical programming matter to a biomedical researcher? Learn how this programming paradigm can increase your productivity and scientific rigor.

February 22, 2018

R Markdown 101

In this video lab, you’ll make your first R Markdown document. This is an application of literate statistical programming.

February 22, 2018

Clinical Data at CHOP

How is clinical data at CHOP stored? How can it be accessed by researchers?

February 20, 2018

Recording Consent

It’s important to track your consent in a digital way, just as you record your research data in a digital way. Check out some suggestions on how to work with consent data in this article.

February 20, 2018

Distributed Humaning

When should you use human effort for things like coding data, versus developing automated solutions? A few thoughts.

February 15, 2018

Privacy Risks

What legal and financial risks do you undertake when you work with data? Learn more about the consequences of data carelessness in this article by attorney and privacy expert Dianna Reuter!

February 8, 2018

Interrogating the Data Until it Confesses

A big part of reproducible research is the responsible, rigorous conduct of research with regards to statistical methods. There are a number of ways to get a publishable manuscript which do not actually give reproducible results. Learn more about p-hacking, harking, and other shady practices here!

February 7, 2018

Statistical Intervals and Visualizations: Difference Between Means

One criticism that seems to have gained traction in the quest for reproducible research is that traditional methods rely too much on point estimates instead of interval estimates. In a 2014 article by Geoff Cummings, a number of visualizations are offered as improvements in demonstrating interval estimates. Learn how to reproduce those graphics in ggplot2. This article shows how to do a means differences graph.

February 4, 2018

Data Dictionaries

What’s a data dictionary? What do you put in it, and why does it matter?

February 2, 2018

R Lab for Beginners

In this lab, we’ll walk you through what to do when you get a .csv – how to bring it into R, do some data cleaning, gather summary statistics for reporting, and do some initial data visualizations. This is a great place to start if you’re brand new to R!

January 31, 2018

Scripted Analysis for Reproducibility

Lots of digital ink has been spilled over the topic of reproducibility in science. This article addresses the technology that supports reproducible analysis of data, using scripted analysis.

January 31, 2018

Base R Plotting

Base R plotting is great for fast data exploration. While these graphics aren’t likely to be attractive enough for publication, they’re perfect for checking out hypotheses and understanding your data more easily.

January 31, 2018

Flat File Data Storage

Flat file data storage includes file types like .csv, .json, and .xml. Learn how these differ from one another and their relative advantages and disadvantages.

January 30, 2018

Cartesian Result Sets

Why does my result set after a data pull have more than one row per subject? If you’ve ever wondered why a simple data query turns so complicated, with multiple rows per person, this article is for you!

January 24, 2018

Literate Statistical Programming

What is literate statistical programming? How is it different from just commenting some code well, or documenting all the changes you do to a dataset in a separate file? Read more here about how literate statistical programming can streamline your scientific production.

January 19, 2018

Sparklines in ggplot2

Sparklines are a great way to show trends over time. In this article we gradually build up a ggplot2 sparkline visualization.

January 18, 2018

Welcome to the Tidyverse!

What is tidy data? This post explores the set of R tools called ‘the Tidyverse’ as well as explaining a bit about tidy data generally.

January 18, 2018

Jupyter 101

If you’re new to Python, working in an interactive environment like a Jupyter notebook might help you hone your code more easily.

January 18, 2018

Git 102

Learn how to use CHOP GitHub by creating your first repository.

January 18, 2018

Git 101

Explaining what Version Control is and beginning an exploration of git / GitHub.

January 18, 2018

Writing Functions in R

If you’re not an experienced programmer, understanding why writing functions is important may seem very abstract. This article gives you some code examples to explain why writing functions makes your code stronger and easier to use.

January 18, 2018

When R Gets Too Helpful

Sometimes R oversteps its bounds by assuming you want something you really don’t. Learn about a few cases where this might happen, and how to avoid it!

March 15, 2017

Version Control Curriculum

Introduction to Version Control using Git and GitHub

March 15, 2017

Glossary of Terms

Educational Pathways

Learn data science shortcuts

Articles