Metadata is, in its simplest definition, data about data. Some examples of metadata might include:
- Who collected the data?
- When was the data collected?
- What units does the data use?
- What kind of thing is the data measuring?
Metadata is critical to data discovery – finding data you’re interested in. For example, given a list of possible data that includes weights and temperatures, you might be interested in only the data that’s from human subjects, not from animal models. Or, when you’re looking at IQ scores, you may want to limit yourself to IQs that were assessed by clinical psychologists, not by research assistants.
Metadata can be found in many places. Sometimes it’s in variable names, like “weight_kg”, which discloses units. Often it’s in data dictionaries or codebooks, where a variable in a dataset is described more completely. Sometimes you might find metadata in an abstract or in the methods section of a paper, or in some descriptive text that accompanies a data download.
Being aware of the need to capture metadata is a crucial step in reproducible science. It’s important to record metadata because you might find that it helps explain or contextualize some findings (e.g. when you notice that the time of day of a blood draw affects your lab results). It will also help others understand if your data applies to their research interest. It’s also helpful when you have turnover in your lab – well defined metadata helps new staff members conduct research more easily and provides research continuity.
Metadata is also critical to supporting discovery and access over time. Metadata for rights and ownership, technical requirements, and privacy provide the information necessary for the management, re-use, and preservation of the products of research.
It’s also important to realize that metadata can exist at various levels of a project. Some metadata is overarching and describes an entire project (e.g. the institution that oversaw the data collection), some metadata catalogs and classifies the tools or products of research like datasets, code files, and auxiliary files, while other metadata adheres to a specific data field (e.g. the make and model of a medical device that gave a certain measurement).
Arcus is developing a data catalog which will use metadata to capture important information about research datasets collected at the Research Institute. Because our research interests and methods are so diverse, we’re taking a general approach, asking researchers to provide high-level metadata that will support managing Arcus content and making discovery across research projects possible – things like the topic being studied, IRB protocol number, and name of the PI collecting the data.