Data preparation is the sometimes complicated task of getting raw data (in a SQL database, REDCap project, .csv file, json file, spreadsheet, or any other form) into a form that is ready to have statistical methods applied to it in order to test hypotheses or describe patterns in the data.

Data preparation is an often overlooked and under budgeted-for part of a research plan. In an ideal world, data collection is carefully planned and conducted with the final analysis in mind. Data quality checks are used at every phase, including double entry for every field. Participant consent and assent are also recorded in an electronic fashion, possibly in the same REDCap project as the data itself. Units are consistent. Variable names are self-documenting. No subject repetition occurs – one row per subject, always complete, always accurate. The research data is tidy and ready to drop into statistical analysis without any additional work.

Unfortunately, that’s hardly ever what happens. You may have changed your analysis goals or not aligned your data collection with your analysis goals. You might have repeating measures to account for. Maybe two values, like systolic and diastolic blood pressure, were included in a single field, but you realize now you have to separate them in order to do math. There’s almost certainly missing data, with research participants neglecting to answer a question or complete a task. Data collection may be inconsistent, with some research assistants or coordinators entering data in one way, while others did it a different way. You may encounter unrealistic outliers, or realize too late that you should have asked a question in a different way. The data you’re trying to analyze may not come from your lab at all – it could be clinical data from the EHR, research data collected by a collaborator, or public data from a government or non-profit agency. Perhaps the variables are named things that are hard to understand, and the data dictionary (if one exists) isn’t terribly helpful. The list of real life complications goes on.

The tasks of data preparation include things like:

  • Checking for data-entry issues (Like “3y5m” in an age field that should be numeric)
  • Analyzing missingness and formulating a strategy (Do you impute values, and if so, how? Exclude the subject from study? Exclude the variable from analysis?)
  • Data harmonization (e.g. changing lbs to kg to accommodate collaborator requirements)
  • Data splitting (for example, splitting blood pressure into systolic and diastolic variables)
  • Aggregating repeated measures (We have between 5 and 13 oxygen sat levels for each patient – how do we get this to a single value? Mean? Median? Most recent? Earliest? Closest to the date of surgery?)
  • Combining data (Both tables have the same subjects, make one table out of this)
  • Checking for outliers and deciding what to do with them
  • Finding logical impossibilities and resolving them (This person has a problem list including both testicular cancer and PCOS)
  • Dealing with data dimensionality (such as removing or combining columns, performing PCA)
  • Deidentification (as needed)
  • Making decisions about tie-breaking (Both of my sources include race data on my participants, but for 5 participants, the race is inconsistent. Which source do I trust more?)
  • Closely examining the data to ensure inclusion/exclusion criteria is met (I’m studying the effect of appendectomy on colon motility, but realize that one of my appendectomies was due not to appendicitis, but cancer, and I realize now I only wanted appendicitis-related appendectomies.)
  • Closely examining the data to see if there are any data collection design problems (I included a question on handedness, but made it checkboxes, not radio buttons, so 3 of my subjects clicked both “exclusively right-handed” and “exclusively left-handed”. Next time I’ll make sure that can’t happen.)
  • Shaping the data for analysis (I am doing group comparisons, so I need to make sure the group identifier is in the table, and I want to ensure that each subject appears in one row and one row only.)
  • Altering data types (some of my character-type variables are free text, but some are categorical, with one of my categorical variables being ordered)

It’s important to realize that the task of data preparation falls on the shoulders of the principal investigator. Yes, you may (and probably should) delegate these tasks to someone on your team, but data preparation must be informed by your subject matter expertise and research question. Data preparation belongs predominately in the research team, not in the data provider. Simply demanding of an external data reporter that they flatten, simplify, or clean the data to meet your needs is generally not a good idea. They may be experts in finding and delivering data, but they’re probably not experts in your field of study.

The work of preparing your data is not trivial – many sources suggest that data preparation constitutes about 80% of the work involved in data-driven research and insights. It’s tempting to try to avoid this expenditure by cutting corners on data preparation. Data preparation is costly in time and effort, but good data preparation has a significant return on investment – a better understanding of the strengths and weaknesses of the dataset, some insights about what kinds of statistical testing or modeling will be the most effective, and a better idea of how to structure experiments and data collection in the future.

If you want to get started working with data preparation, check out the R Lab for Beginners or Python Lab for Beginners, which show you the first steps of ingesting data and starting to work with it.

If you want to understand a bit more about why data from the clinical record is messier than you would like it to be, check out Getting to One Row. To learn how to work with clinical data in R, consider using our Swirl course.