What is a data dictionary, and why does it matter? Do you really need one, if your project is small?

A data dictionary can take many forms, from a simple .csv you create in Excel to a complex, nested set of json definitions and everything in between. It captures your metadata: the data about the data you collect. Consider what you would want to know about a dataset a collaborator from another institution gave you. Some of these metadata will need to be recorded as part of data collection, as they’ll change with each subject, and others are static across the study and should be included in a data dictionary.

  • Units – is age in months, whole years, years:months, decimal years, categories like “under 12”?
  • Source – is IQ from the WISC-IV? The DAS-II? Which IQ score – verbal, non-verbal, full-scale?
  • Coding mechanism – was the observation double-coded by two raters? How were raters judged as to their reliability?
  • Reasonable values – should you expect only integers? Decimals? Positive numbers only?
  • Text of the question – were parents asked “Does your child sleep through the night?” or “Has your child slept through the night more than 90% of the time for the past 6 months?”
  • Equipment – what kind of EEG machine created these measurements?
  • Laboratory methods – how was microbial contamination controlled? What protocol was used to count cells?
  • Date of collection – how old was the subject? What time of year was it?
  • Time of collection – what time of day?
  • Temperature, humidity, other climate effects – was the skin swab collected on a hot, humid day in mid-summer or in an air-conditioned hospital on inpatients?
  • Who did the collection – did a psychiatric nurse administer the screening? A research assistant?
  • Missing data information – why are some elements missing?
  • Data interpolation – was any interpolation of missing data carried out? Was proxy data included?
  • Reporter – who filled out the phenotypic questionnaire? The subject? Their parents? A teacher? A spouse?

At the very least, a data dictionary should include the variable name, what it measures (as precisely as possible), and the data type. You can do this in Excel and save it as a .csv:

Variable Name Description Data Type
age subject age at time of data collection, in whole months integer
IQ Full-scale IQ standard score as measured by WISC-IV, administered by clinical psychologist integer
hemoglobin g/dl decimal
diabetes_dx Current type 2 diabetes dx in medical record Categorical : True / False

You could also include notes fields in the table, and/or supplement your data dictionary proper with a text document briefly outlining collection, coding, and data entry methods.

This all seems like a lot of work, but in a busy lab, it isn’t terribly unusual to have flux – research assistants and post docs coming through and doing work for a period of time, then leaving, without having documented their assumptions or methods. It’s disheartening to find data that is poorly annotated and requires hours or days of work to fully grasp. Additionally, creating a robust data dictionary at the start of your work allows you both to make sure you’re collecting all the things you need and gives you a jump start on your methods section of any publications. Additionally, it might make you aware of data collection issues. Realizing that certain data points might not be available for certain subjects, or narrowing down on only certain data sources may mean you need to recruit more subjects, or more broadly.