Arcus Annotations and cTAKES


One tool that has emerged as an industry leader when it comes to turning notes into searchable, meaningful data is Apache’s cTAKES™. To find out more about how cTAKES™ works, see the Apache cTAKES™ Wiki. cTAKES has the distinct and important ability to find negatives (such as “Patient does not have…”) in text analysis—which the RDoC parsing code we are using does not, as it considers any mention at all, either negative or positive, worthy of note.

This video is an excellent and relatively short introduction to cTAKES:

You can play with the web interface of cTAKES here. Please note that the cTAKES web interface is for demonstration purposes only and is not meant to be run on real clinical data.

Technical Details about cTAKES

If you watched the above video you are well on your way to understanding what Arcus Annotations will provide by way of rich clinical data. However, when you write for publication, you may have to provide greater detail. The following sections are for your use when you need to explain to someone exactly how cTAKES works with our note data.

cTAKES Terminology

  • Pipeline. A complete sequence of cTAKES annotators performing a comprehensive NLP task.
  • Analysis engine. A single cTAKES annotator used in a larger pipeline. A complete pipeline is sometimes referred to as an Aggregate Analysis Engine.
  • Piper file. A plaintext file describing a pipeline
  • CAS. “Common Analysis Structure.” This is the data structure through which the annotators in a pipeline communicate. The JCAS is the specific Java implementation of this data structure that is used in cTAKES.

Many of the terms you will encounter when reading about cTAKES come from the Apache UIMA project. UIMA is a larger Apache project for building systems to process unstructured text. To learn more, see the Apache UIMA introduction.

The cTAKES Pipeline as Implemented at CHOP

A given cTAKES configuration, called a pipeline, has multiple stages (annotators) that perform specific tasks like part-of-speech tagging or detecting term negation. The steps are run in sequence, and the annotations created by one annotator can be used by an annotator downstream of it in the pipeline.

You can set which annotators will run in a pipeline and you can set individual annotators’ configurations. Most notably, the named entity recognition (NER) annotator that tags text with mentions from ontologies (e.g. SNOMED CT and ICD 10) can be configured with different ontology dictionaries.

cTAKES (version 4.0.0) provides a pipeline, called the Default Clinical Pipeline, as a suggested starting point for processing clinical text data. From the cTAKES website:

The Default Clinical Pipeline produces the most commonly desired output from cTAKES. This includes annotations for Anatomical sites, Signs/Symptoms, Procedures, Diseases/Disorders and Medications. For each annotation there are normalized UMLS CUIs, plus values for negation, uncertainty and subject.

In order to create a set of annotations that will be useful to the widest audience of Arcus users, we have implemented the Default Clinical Pipeline to annotate raw note data at scale at CHOP (remember those hundred millions of notes? That’s the scale). We chose to use the recommended Default Clinical Pipeline for three main reasons:

  1. These annotations can be used as a baseline for groups doing their own NLP research.
  2. Ease in citation: If you use these annotations in your own research, it will be straightforward to cite the specific configuration of cTAKES that was used.
  3. Reproducibility and collaboration: It’s easy for people to share their findings and explain how they arrived at them because the model is commonly understood by many people doing research using raw text data.

The pipeline contains a series of annotators (I personally cannot get the image of annotating monkeys out of my head sitting Dr. Seuss-like at desks in a pipe, one above the other). We do not save the direct output of each annotator in Arcus, only the annotations referenced in your BigQuery schema definition. For example, we do not save part-of-speech or sentence boundary annotations. Nevertheless, these components run because their output is used in downstream annotators such as the named entity recognition annotator.

The Default Clinical Pipeline contains these components:

  1. SimpleSegmentAnnotator (the first monkey)
  2. SentenceDetector (the second monkey)
  3. TokenizerAnnotatorPTB (etc.)
  4. ContextDependentTokenizerAnnotator
  5. POSTagger
  6. Chunker
  7. DefaultJCasTermAnnotator
  8. ClearNLPDependencyParserAE
  9. ClearNLPSemanticRoleLabelerAE
  10. PolarityCleartkAnalysisEngine - This is a support vector machine model trained on data from the SHARPn project (contains notes from the Mayo Clinic from a variety of specialties).
  11. UncertaintyCleartkAnalysisEngine - This is a machine learning model
  12. HistoryCleartkAnalysisEngine - This is a machine learning model
  13. ConditionalCleartkAnalysisEngine
  14. GenericCleartkAnalysisEngine
  15. SubjectCleartkAnalysisEngine

The only modification we have made to the Default Clinical Pipeline is the ontology dictionaries used by the Named Entity Recognition annotator. We will use these:

  • Human Phenotype Ontology (HPO—available in the alpha release)
  • SNOMED CT (Coming soon)
  • RxNorm - (Coming soon)

The version of each of these ontologies is as included in the 2018AB release of UMLS.

How to Access cTAKES Annotations: BigQuery

As notes are imported into Arcus, the cTAKES pipeline is run and the resulting annotations are inserted into a table in Google BigQuery. Early adopters of the Arcus platform will have access to this table via Google’s BigQuery UI and command line tools. You can see schema in yaml form here. This will soon be available in the Arcus Metadata Browser.

Here is an introductory video about BigQuery:

Follow this link to learn more about BigQuery, or see this list of tutorials.

Useful Technical Stuff

This list contains useful references.

Attribute Annotations

cTAKES Documentation

Fast Lookup Annotator (Named Entity Recognition Component)

Generic Annotator

Machine Learning Library

ClearTK is the machine learning library used within several of the cTAKES annotators in the default clinical pipeline.

Negation Detection

Pipeline Files

For reference, the main file that describes the pipeline can be found at DefaulFastPipeline.piper. It imports the following files:


You can’t yet run Arcus Annotations on your own text, but we will be supporting this functionality in the future. For now, we are running the pipeline only on notes as they are imported into Arcus. That saves you time when you want to access note data—they have already been annotated, likely while you were asleep.