Spencer Lamm
Spencer Lamm
2 min read

Arcus Data Catalog

Why is Arcus interested in building a data catalog that archives research from across CHOP and makes it findable? Because our investigations have shown that there are opportunities for new or expanded research at CHOP that are being missed because data discovery has been difficult, manual, and subject to good luck and meeting the right people at the right time. The data catalog of Arcus allows researchers to tell others at CHOP what they’re working on, invite collaboration, find interesting patterns of subject overlap, and more. We also know that there are often opportunities for researchers with overlapping interests to collaborate, and we want to make those opportunities more apparent.

Subject populations provide a prime example. We want to bring together disparate data about our research subjects in order to have a richer view of their clinical profile, their research phenotype, their genetic information, and knowledge about any biosamples they might have. Instead of relying just on the data you can collect in a research lab or in an Epic chart review, you could potentially find whether biological specimens are available for your subject, any genetic sequencing that might exist, and other datasets from other labs or public data sources that might include your subject. This allows you to multiply your efforts and get a richer picture of your subject’s health, environmental exposures, socioeconomic status, and genotype.

Additionally, imagine finding researchers who are interested in your population, either because of similar research interests or because of common comorbidities. Your research data could be reused for new research, allowing you to be cited or giving you authorship opportunities you weren’t aware of previously.

One important aspect of the Arcus data catalog is that we know that you are the experts in your own data. We don’t expect you to make the data you collect conform to a pre-defined shape, with special variable names that don’t apply to your data. Participating in the data catalog is not like submitting your data to a registry, struggling to make your data fit in a schema that’s not quite right. The data, and data dictionary, that you submit, are tailored to the research you conduct, and that’s what we want. We’ll help describe your data using metadata, so that we, and your fellow researchers, know things like the purpose of your study, its criteria, the format of your data, your role in collecting it, the dates of collection, and so on. These metadata descriptions form the Arcus data catalog, a place where all CHOP researchers can search for datasets they might find helpful or interesting in their own scholarship, as well as research products like SQL code, cohort definitions, and auxiliary files like REDCap data dictionaries and instrument lists.

Cataloging our data well, and making sure we can preserve our data for the long-term, has a number of benefits not only for collaboration, but also for your own research. If you’re concerned about the long-term survival of your data, and ensuring that it’s stored in a safe place that won’t get lost or forgotten, this is a great opportunity for you. Whether you’re a fellow who doesn’t have the time or resources to gather your own data and are looking to take advantage of well-curated datasets, or you’re a precision medicine innovator who wants rich phenotypes from across multiple CHOP sources in order to do new scientific discovery, we think that the Arcus Data Catalog will be helpful to you.