Arcus Annotations: Harvesting Data from Text Notes

Why use Notes?

Notes fields in the electronic health record (EHR) are a gold mine of information. They are expert evaluations of health status expressed naturally rather than as checkbox answers to standardized questions. There are over 100 million notes in the CHOP system: they include everything from providers’ notes to phone calls.

The Benefits—and the Challenges

Notes express important information about progress, well-being, and the overall gestalt of a patient’s health in a way that discrete fields cannot. They have the potential to highlight aspects of a patient’s well being in more detail and nuance than CPT codes or ICD diagnoses can. Yet because they are unstructured and composed of natural language, it’s hard to extract this valuable information. If you’re working with only a handful of notes, it’s feasible to train students or research assistants to code the notes field: for instance, looking for subtle signs of depression, such as “pt seems down today, not participating in child life activities.” But once you get past a few dozen patient records, it’s obvious that automation is, if not necessary, extremely helpful.

For these reasons and others varying from technical (exactly how do you search for all the children who had headaches in 2018?) to ethical (how do you anonymize notes for use in a study when doctors have freely mentioned home locations, age, or dates of birth in them?), investigators at CHOP have not had a method for gleaning large amounts of note data for their studies.

But they soon will. Arcus is preparing to launch tools for investigators to use natural language processing techniques to garner anonymized information from millions of notes.

From Notes to Annotations at CHOP

Automation of the subtle human understanding of language is difficult. That’s where natural language processing comes in. Natural language processing (commonly “NLP”) is a way to detect things like sentiment, syntax, topics, instances of specific entities (such as people or places), and much more from natural language. NLP is complex, computationally taxing, and time-consuming—but it can lead to fruitful information such as outcome predictions, diagnoses, binning or clustering, and more. Because NLP is so useful, and because medical records notes fields are often untapped, a number of tools have emerged that use NLP on medical notes to detect medical topics and to allow scientists who are not NLP experts to mine notes fields for medical terms that correspond to a given ontology.

Applied Ontologies

An ontology (Greek: ontos—being, logos—word, or “words about the existence of things”) is a description of what exists and of how what exists interrelates. In our field, an ontology is a hierarchy of medical terms that are related to one another and describe a health condition, symptom, or a diagnosis. You may be familiar with SNOMED or ICD, for example, which are widely used ontologies. Another ontology is the Human Phenotype Ontology, or HPO.

You can transform the various, sometimes verbose ways that clinicians might describe a condition, symptom, anatomical location, or procedure into a single label that belongs to a system used around the world. For example, you might have two notes like “pt is a 48m female presenting with neck stiffness and pain” and “patient complains that stiff neck continues”, and both notes could resolve to HPO term HP:0025258, “Stiff neck”. A stiff neck may be such a minor complaint it doesn’t make it into the problem list, nevertheless it represents an interesting symptom that you’d like to extract from the notes field.

Raw Text Data Contextualized, Anonymized, and Refined as Annotations

So that’s the basic process: We start with millions of raw text notes and use NLP tools to turn those notes into anonymized, meaningful annotations you can use in your research as you draw upon a data source that hasn’t been available on such a large scale at CHOP until now.

To understand in detail how Arcus annotations work, please see the articles about the two tools we will be using to present them to you: RDoC and cTAKES. RDoC is a construct that the NIMH has developed over the past ten years to expedite research. It is a “living” construct in that it continues to change as research is updated. Not only that, but it has also lately started to gain traction as an excellent framework for analyzing raw note data related to mental health. cTAKES is another tool that has gained a lot of attention in the medical data realm. It takes raw note data and transforms it into annotations that are easier to study than natural language and that are anonymous—thus protecting our patients.

Arcus Annotations will be the result of using code to map notes onto the RDoC and to classify and evaluate notes according to ontologies (first the human phenotype ontology, and later, others). Annotations are just the beginning. Your research can be the result of analyzing notes in many medical fields or you can add to the functionality of the tools Arcus is providing at the outset. We are writing the first chapter in a journey that will take us who knows where. The next chapter is yours to write.