Algorithm 1: a process or set of rules to be followed in calculations or other problem-solving operations.

Anaconda : a Python distribution that includes just about everything you need to begin using Python for data analysis. It’s freely available at the Anaconda distribution website.

ANOVA : Stands for ANalysis Of VAriance. ANOVA is an omnibus test used when you have more than two groups and want to see if there is a statistically significant difference in the means. For example, is there a difference in the means of LDL cholesterol in subjects that adhere to four diet types (vegan, paleo, low-carb, and Mediterranean)?

Application programming interface (API) : programmatic communication between different software components, allowing information to flow between two different systems (say, REDCap and an R script) as needed, without human interaction.

Artificial intelligence 1: the power of a machine to copy intelligent human behavior. At times this term gets conflated with machine learning.

Blocking or Block design – Controlling variability in an outcome by creating groups based on characteristics you can’t control, like sex.

Cloud computing : an off-site (“somewhere in the cloud of the internet”) shared system of resources (like servers, applications, storage) usable by many individuals and groups. Common cloud computing providers include Amazon Web Services (AWS), Google Cloud Platform (GCP), Rackspace, Microsoft Azure, etc. Cloud computing can be cost effective and robust to changes in needs (say, if your application sometimes spikes in its resource needs).

Common data element (CDE) 1: piece of data common to multiple datasets across different studies (may be universal or domain-specific).

Crowdsourcing 1: a distributed model in which individuals or groups obtain services, ideas, or content from a large, relatively open and often rapidly-evolving group of internet users.

Database/Data repository 1: Data storage that stores, organizes, validates, and makes accessible core data related to a particular system or systems. Generally speaking, most databases and data repositories rely on a “relational” database model (which use various flavors of SQL), which has interrelated tables, each of which has rows and columns. There are other types of databases which do not have this kind of tabular format. While very different from each other, these non-relational, non-SQL database types are often grouped under the same term, noSQL.

DataCamp is a data education website with free and paid courses on R, Python, and SQL. Benefits of using DataCamp include a browser-only interface (nothing to download) and structured courses that walk you through the basics, step by step. If you’ve never used R, Python, or SQL, it might be worth trying one of their free intro classes to get you over the initial inertia of not knowing how to start!

Data commons 1: a shared virtual space in which scientists can work with the digital objects of biomedical research such as data and analytical tools.

Data ecosystem 1: a distributed, adaptive, open system with properties of self-organization, scalability, and sustainability inspired by natural ecosystems.

Data integrity 1: the accuracy and consistency of data stored in a database/data repository, data warehouse, data mart or other construct.

Data science 1: interdisciplinary field of inquiry in which quantitative and analytical approaches, processes, and systems are developed and used to extract knowledge and insights from increasingly large and/or complex sets of data.

Dataset 1: collection of related sets of information composed of separate elements that can be manipulated computationally as a unit.

Data visualization 1: effort to help people understand the significance of data by placing it in a visual context. Patterns, trends and correlations that might go undetected can be exposed and recognized more easily with data-visualization techniques.

Deep learning 1: type of machine learning in which each successive layer uses output from the previous layer as input; similar to communication patterns in a biological nervous system.

Domain-specific 1: for biomedical data, designed and intended for use in studies of a particular topic, disease or condition, or body system (compare to universal).

Effect Size : a standardized measure of how large the difference detected by a statistical test is. It’s a proportion of the difference discovered over the dispersion of the samples. To make this more concrete, imagine discovering a statistically significant difference in sleep amounts between insomniac patients taking a placebo and insomniac patients taking a new sleep medication. You could have a statistically significant difference at p = .03, where the difference in means is 12 minutes, or a statistically significant difference at p = .03, where the differnce in mean is 2 hours. Clearly, one finding is more interesting. Effect size allows us to go beyond statistical significance and get some idea of how important the discovered difference is. It is important to realize that it is easier to detect a large effect size than a smaller one, and that knowing the effect size you’re interested in will affect your power and sample size.

Elastic describes a service that can get bigger as needed. This is an advantage of cloud computing platforms, which can be set up to use more servers (or other resources) as needed. For example, a website might be set up to add servers when traffic exceeds 70% of maximum, so that no one gets an “unable to connect” error when going to the website.

Electronic medical record (EMR) or Electronic health record (EHR) 1: digital version of a patient’s paper chart. EHRs are real-time, patient- centered records that make information available instantly and securely to authorized users. At CHOP, our EHR system is Epic.

Enterprise system 1: computer hardware and software used to satisfy the needs of an organization rather than individual users. “Enterprise-grade” usually signifies high capacity, customer support, robust applications that support business functions that cannot fail.

Exascale computing 1: a computer system capable performing one quintillion (10^18) calculations per second.

Excel is spreadsheet software that many people use for research. This is a bad idea for multiple reasons: it’s not reproducible (no syntax file or scripted play-back available of the manipulations you do, unless you use VBA), it makes assumptions about data that are often wrong (for example, genomics research is rife with gene name errors caused by Excel), and it’s not rigorous with data (for example, you can accidentally sort just one column instead of a whole sheet and end up with erroneous data). If you typically use Excel to work with data, consider learning R.

Extramural 1: research or other activities supported by NIH and conducted by external organizations, and funded by grants, contracts, or cooperative agreements from NIH.

Genotype 1: the genetic make-up of an individual organism

git is a version control program that allows you to track changes to your code (or other text-based files), understand who changed a file and why, manage multiple canonical versions (say, if you have one version for use with PHI and one with de-identified data), and go back to older versions as required.

GitHub is a website that uses git and provides a nice user interface for working with git. GitHub also offers a useful graphical software package called GitHub Desktop, which is useful for working with git without having to know how to use the command line. It’s great for git beginners.

Hardening 1: process of optimizing a tool or algorithm to industry standards to ensure efficiency, ease of use, security, and utility.

Hardware 1: collection of physical parts of a computer system.

Indexing 1: methods to allow data finding and retrieving.

Interoperability 1: in computer systems, the ability to exchange and make use of information from various sources and of different types.

Intramural 1: research or other activities conducted by, or in support of, NIH employees on its Bethesda, Maryland, campus or at one of the other NIH satellite campuses across the country.

Jupyter Notebook (formerly iPython): This is a literate statistical programming approach for Python. A Jupyter Notebook is a “REPL” (Read, Execute, Print Loop) which allows you to work interactively with Python, executing code bit by bit, changing it up, and adding markdown. This means you don’t have to write and execute a whole program to check that your code works. This is great for exploratory work as well as for showing output for other users of your code.

Knowledgebase 1: virtual resource that accumulates, organizes, and links growing bodies of information related to core datasets

Machine Learning is a broad category of heavily computational analysis and prediction methods that create models of data which depend on finding patterns that organize data (e.g. unsupervised learning like cluster analysis and feature detection and extraction) and predict outcomes (e.g. supervised learning like classification or regression predictions). Machine learning can relatively simple and be very reminiscent of traditional mathematical approaches like linear regression, or quite complex, as when multiple models are combined (“ensembled”) to improve predictive performance. Machine learning can result in surprisingly accurate predictions, but it can also admit of significant problems, like models that are not easily actionable, or models that are too narrowly focused on a small set of data (overfitting).

Metadata 1: data that describe other data. Examples include title, abstract, author, and keywords (publications); organization and relationships of digital materials; and file types or modification dates.

Multiple Factor or Multifactorial: Describes an experiment or observation in which an outcome (or dependent) variable is measured in relationship to more than one explanatory (or independent) variable. For example, the study of adult BMI as a function of birth weight and Adverse Childhood Events (ACE) score.

NIH IC 1: NIH Institute or Center

*omics 1: collective characterization and measurement of pools of biological molecules that translate into the structure, function, and dynamics of an organism or organisms. Examples include genomics, proteomics, metabolomics, and others.

Petascale computing 1: a computer system capable performing one quadrillion (10^15) calculations per second. It is currently used in weather and climate simulation, nuclear simulations, cosmology, quantum chemistry, lower-level organism brain simulation, and fusion science.

Phenotype 1: the set of observable characteristics of an individual resulting from the interaction of its genotype with the environment.

Platform 1: group of technologies (software and hardware) upon which other applications, processes, or technologies are developed.

Platform as a Service (PaaS) 1: a type of cloud computing that allows users to develop, run, and manage applications without the complexity of building and maintaining an overarching infrastructure

Power: The probability that a statistical test will correctly reject the null hypothesis when the alternative hypothesis is true. Power is the ability to avoid false negatives, and is closely related to effect size and sample size. Often, we choose a .8, or 80% power, which means that we expect our test to be able to detect true positives (where the alternative hypothesis is true) 80% of the time.

Provenance 1: timeline of ownership, location, and modification.

Python is a programming language that has multiple uses. It can build desktop applications, websites, perform image manipulation and analyses, and do advanced statistical and data-centric workflows. Unlike R, python is multi-purpose. Like R, it’s a true programming language which supports script-based reproducible research. If you have experience programming in other languages like Java or C++, you might like Python more than R, especially if you are doing natural language processing or web scraping or need to do tasks beyond data analysis, such as building a website or creating an application. On the other hand, if you’ve mostly used Excel or a commercial product like SPSS to do your research and don’t have experience programming, R makes a better starting point.

R is a statistical programming language which is open source – it’s free and anyone can contribute to its development, or create packages that extend its usefulness. It is different than Python because it’s not a multi-purpose language used for all sorts of things – it’s built for statistical analysis. R is also different from commercial statistical software like SAS and SPSS in that it is not a point-and-click solution. R is widely used in academia because it can be tailored to very specialized analyses, supports true reproducible research, is free, and gets innovative analytics support earlier than commercial software. R is best used within the RStudio development environment (also free!).

Reproducibility: There are periodic kerfuffles about whether the right term is reproducibility or replicability, but here we mean an approach to research that makes it easy to re-run an analysis on a new (or updated) dataset, with a minimum of problems. An example of research that’s not reproducible would be making a lot of manual choices in Excel, like cleaning data, removing outliers, and using formulas to create new variables and analyses. This is very difficult to reproduce (as you know, if you’ve ever tried to replicate a finding based on a methods section!). An example of reproducible research would be a sample file that has publicly-sharable data, or fabricated example data, and an R script that takes a file with that format, conducts analyses, and creates statistical output like p values. That would be fairly easy for a different researcher to pick up and work with.

Retirement (of data) 1: the practice of shutting down redundant or obsolete business applications while retaining access to the historical data.

Sample Size : the number of observed entities (in our case, often animal or human subjects) used in a research project. The larger the sample size, the more specific the statistical tests can be. When research aims to find a small effect size , a large sample size is required to maintain the same power.

Single Factor : Describes an experiment or observation in which the outcome (or dependent) variable is measured in relationship to only one explanatory (or independent) variable. An example would be analyzing adult BMI as a function of birth weight.

Software 1: programs and other operating information used by a computer.

Software as a Service (SaaS) 1: software licensing and delivery model in which software is licensed on a subscription basis and is centrally hosted.

System integrator/system engineer 1: individual who refines and hardens tools from academia to improve user design, authentication and testing, and optimize productivity, efficiency, and outcomes/performance.

Unique identifiers 1: an alphanumeric string (such as 1a2b3c) used to uniquely identify an object or entity on the internet.

Universal 1: for biomedical data, usable in research studies regardless of the specific disease or condition of interest (compare to domain-specific).

Version Control : Version control is a system that helps tracks various versions of files (usually code). To grasp the concept, think about “track changes” in Microsoft Word or editing a document Google Docs, where you can see who made what changes and why, and go back to any version you need to. The most popular version control system is git, which is a distributed version control system. Many people use git within GitHub, which offers not only version control, but file hosting and a simple user interface that may be easier for beginners to use than command-line git.

Wearables 1: devices that can be worn by a consumer that collect data to track health.

Workflow 1: defined series of tasks for processing data.

  1. Definition taken from or adapted from NIH Strategic Plan for Data Science  2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40