CHOP already has Epic and a large, data-rich data warehouse (the CDW). Why is Arcus creating a new data repository with clinical data? In this article, let’s take a look at how the Arcus Data Repository (ADR) will include a few key advantages for researchers when it launches in the coming year.

De-identified Data

An initial strength of the ADR is in the privacy scope of its data. Researchers who use the ADR get access to exclusively de-identified data. Because of this, research conducted on data from the ADR is not considered human subjects research. This means that IRB oversight is not required, reducing administrative burden and speeding time to science. Retrospective studies of de-identified data can be carried out rapidly, which is a boon to researchers with limited time, lab resources, and funding.

Simplified Schema

Working with clinical data is notoriously messy. Like most databases, Clarity (the reporting database for Epic) and the CDW (CHOP’s Clinical Data Warehouse), which both hold not just Epic data but data from related and legacy systems) are highly “normalized” – a technical way of describing how data is split up into various tables to avoid needless repetition of the same information over and over. This means there are thousand of tables, linked together with various keys that can be used to re-link the data together across table joins. Normalization is important for large databases that need to accommodate large amounts of data that may stream in at high velocity. Very strict normalization, however, makes data hard for humans to understand and assemble. For example, to get from a specific patient to a medication name, you might have to associate an MRN to all associated encounters, take those encounter ids and look up all the associated medication orders, trace the order id to the medication codes included in that order, and all the medication codes to a medication name in a lookup table. And keep in mind there may be dozens of medication-related tables that you might have to look at to figure out which one is the right one for you.

Currently, there are several groups across CHOP that have years of experience of knowing the complex interrelationship of these tables. These data analysts know how to navigate potentially conflicting or erroneous information and how to write complex, hundred-line-long SQL statements to extract just the data a researcher needs. But for more simple data requests, Arcus offers a much more simplified, easy to navigate Arcus Data Repository. The ADR still has several keys that link related data, but some denormalization has been done to make tables easier to understand in context. There are fewer tables to look through, and fewer table joins required to get the data you need. That means Arcus can put data at your fingertips using a web interface that will allow you to select the data you want in a reasonable number of tables (not hundreds or thousands!) with just a few clicks. No SQL experiences is required – our web tool will do the SQL code for you. The selected data will be available for you to work with in a computational environment provided by Arcus, so that you can analyze, reshape, and aggregate data as needed. Then you can export aggregate data, visualizations, or tables that meet your analytic needs.

The Goldilocks Principle

We aim to make the ADR “just big enough” – with enough fields to be useful for most researchers, but without the thousands of details that are only interesting for very specific use cases. Those very specific use cases can still be handled by groups like DBHi’s Clinical Reporting Unit (CRU), while researchers with simpler requests (“I’d like to see sex, age, and length of stay for all patients admitted for RSV over the last four years”) may be able to do their work in a self-service way, using the ADR.

Accuracy and Speed

The ADR has been thoroughly tested and vetted, and provides data provenance information on every field, so that you can understand what system the data comes from. It incorporates standard enterprise and research definitions, taking the guesswork out of figuring out things like patient status (inpatient, outpatient, ED, and urgent care). Besides the data repository itself, there will be an easy to use data navigator that describes in detail what’s contained in each table and what every field name means.

Additionally, the ADR is cloud-supported, on a high-speed, easy to use SQL service offered by Google Cloud Platform. You will get results quickly and easily in a system that is designed to provide quick throughput for multiple users.

Not a Replacement

The Arcus Data Repository isn’t intended to replace the CDW, Clarity, or Epic. It’s intentionally a very simplified set of data that exists to offer many researchers important access to a self-service database. In the ADR, scientists can look at the data themselves, without waiting days or weeks to get expert data reporting teams to return data to them. Complex cases or local data downloads will still require the assistance of groups like the Clinical Reporting Unit.
Tell me More!

Are you interested in being a beta tester for the Arcus Data Repository? Want to be the first on your team to get to work with and influence the ADR? Reach out to Marianne Chilutti. We plan to begin working with researchers with just six basic tables that represent the most commonly used fields needed for researchers, and as we gain experience and learn from our beta testers, expand the ADR as needed to get the right balance of big enough to be useful, but compact enough to be understandable and intuitive.