Data Methods

Data Methods

To ensure high quality data, the All of Us Research Hub employs a comprehensive data methodology to curate data for registered researchers.

Data Curation

Data Sources Data
Sources
Arrow
Data Harmonization Data
Harmonization
Arrow
Data Refinements Data
Refinements
Arrow
Curated Data Repository Curated Data Repository
Plus
Data Dictionary Data
Dictionary

Data Sources

All of Us Research Program collects data from a wide variety of sources, including surveys, electronic health records (EHRs), biosamples, physical measurements, and wearables like Fitbit.

EHR Data Harmonization

The All of Us Research Program uses the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) to standardize EHR data for all researchers.

Data Refinements

After harmonizing the EHR data to meet the specifications of the OMOP CDM, we process the data to ensure participant privacy is protected. We also take steps to conform and clean the data to deliver high-quality data.

Curated Datasets

All of Us Research Program data in its final format, after harmonization and refinement, are referred to as a curated dataset. Three different levels of information are available:

The Public Tier dataset displays high-level summaries of the data available for research. Through the Data Browser, one can explore anonymized, aggregated participant data and summary statistics.

The Registered Tier dataset may be accessed by registered researchers in the Researcher Workbench. Registered Tier data include individual-level data from surveys, physical measurements, longitudinal EHRs, and wearables like Fitbit. This individual-level data must be analyzed within the secure Researcher Workbench.

The Controlled Tier dataset may be accessed in the Researcher Workbench by registered researchers who complete all necessary requirements. In addition to all of the data in the Registered Tier, the Controlled Tier data include genomic data and expanded demographic, EHR, and survey data. Genomic data include short-read whole genome sequences (WGS), long-read WGS, structural variants, and genotyping arrays.

Visit the Data Access Tiers page to learn more about our tiered-data access model.

Data Dictionary

The All of Us Data Dictionary documents what data are available from participants and what modifications the program makes to protect participant privacy. It provides a description for each data field, noting whether it is a standard OMOP field or a custom field created to help capture data unique to the program. The Data Dictionary also provides information on whether the data in each field come from participant health records or from information the participants provide themselves, like survey data. The Data Dictionary details some ways we clean the data to improve data quality, as well as many of the program custom concept IDs for easy reference. This resource includes versioning data so you can see what has been changed, added, or removed since the previous curated dataset.

Explore the Registered Tier and Controlled Tier Data Dictionaries.

GENOMIC DATA CURATION

Individual-level genomic data from short-read whole genome sequencing (srWGS), long-read whole genome sequencing (lrWGS), and genome-wide genotyping arrays are available within the Researcher Workbench’s Controlled Tier.

DATA SOURCE

Most All of Us participants contribute biosamples such as blood and/or saliva. DNA from these samples is extracted and sent to genome centers for genomic analysis, including whole genome sequencing (WGS) and genome-wide genotyping.

FILE TYPES & FORMAT

The All of Us genomic dataset contains the following variant call files, raw genomic data, and annotated genomic data:

  • Arrays: Variant Call Format (VCF), Hail MatrixTable (MT), PLINK 1.9, IDAT
  • srWGS (whole genome): Single nucleotide polymorphisms (SNPs) and insertions/deletions (indels) Hail 0.2 Variant Dataset (VDS)
  • srWGS compressed sequence alignment (CRAM)
  • srWGS SNP & Indel (exome only): VCF, PLINK, Hail MT, and BGEN
  • srWGS SNP & Indel (common variants): VCF, PLINK, Hail MT, and BGEN
  • srWGS SNP & Indel (clinically relevant variants): VCF, PLINK, Hail MT, and BGEN
  • srWGS SNP & Indel (annotated variants): Variant Annotation Table
  • srWGs genomic metrics, srWGS genetic ancestry, srWGS admixture estimation, and srWGS pharmacogenomics haplotpye calls and predicted phenotypes: TSV files
  • srWGS structural variants (SV) VCF
  • lrWGS SNP & Indel and SV: VCF and Genomic VCF (GVCF)
  • lrWGS SNP & Indel Hail MT
  • lrWGS binary alignment map (BAM), Graphical Fragment Assembly (GFA), and FASTA
  • lrWGS sample metrics: TSV

Researchers can access auxiliary information, such as computed ancestry and quality reports in the User Support Hub.

GENOMIC DATA QUALITY CONTROL

The All of Us Research Program performs stringent quality control (QC) procedures to ensure that we provide researchers with high-quality genomic data. Our QC methods confirm sample quality and genetic variants within DNA sequences. Short-read and long-read WGS samples are joint called, which combines evidence from multiple samples to filter out systematic biases. Samples that do not pass QC thresholds are not released.

  • Array QC

    All array samples undergo QC processes to determine any issues with sample swapping, contamination, or preparation. All genome centers follow identical pipelines to generate array VCFs. After sequencing, QC processes include sex concordance, call rate, and cross-individual contamination rate.

  • srWGS QC

    srWGS QC is performed using the same protocol and software at each genome center. Each sample is checked individually to determine if a swap, contamination, or preparation issue has occurred. We verify genotype fingerprints, identify appropriate sex chromosomes, and check the sequencing coverage. We also run QC processes on the SNP/indel joint callset and SV joint callset. With a joint callset, our analysis uses information across all samples to help us detect noisy samples, remove artifacts, and ensure sequencing quality meets genomic data standards.

  • srWGS SV QC

    Additional QC is performed on the srWGS dataset to perform SV calling. Each sample is checked individually for contamination and sequencing metrics are evaluated to check for outliers. We also run QC processes on the SV joint callset to refine the variant calls and remove variants that are not backed up with high quality sequencing evidence.

  • lrWGS QC

    lrWGS individual samples are checked with genotype fingerprinting and sequencing metrics to determine if there are any data issues. We perform QC checks for individual samples to determine if the data matches what we expect, including checking the sex chromosomes and any sample contamination. The joint long-read callset uses information across samples to identify variants.

DOWNSTREAM VALIDITY

All of Us runs analyses to demonstrate specific capabilities of the data and to communicate caveats in the data to researchers. We do this through Genome-Wide Association Studies (GWAS). GWAS help us validate the All of Us genomic data by replicating previously published results.

Observational Medical Outcomes Partnership (OMOP) Common Data Model

The Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) is maintained by an international collaborative called the Observational Health Data Sciences and Informatics (OHDSI) program. The All of Us Data and Research Center leverages the OMOP CDM to empower researchers by using existing, standardized vocabularies and a harmonized data representation. These factors enable connection to other ontologies, datasets, and tools that use the same codes or data model. Learn more about OHDSI’s OMOP CDM initiative.

As a researcher, here’s what you should know about OMOP:

OMOP is a relational database. A relational database is a set of formally described tables with defined relationships allowing data to be accessed in many different ways. For researchers, it may be helpful to get familiar with the curated dataset’s OMOP Tables.

OMOP is standardized. Standard vocabularies mean that, despite differences in how each data element may be captured (e.g., variation among the many electronic health records), all of the data are represented consistently in the data model. For each broad category of data, or domain, OMOP incorporates important existing vocabularies so that everyone using the data can speak the same language.

OMOP is where metadata rules. The use of these vocabularies and concept IDs allow flexibility in extracting data. Instead of just the source data, which are often highly specific to individual institutions, OMOP provides concept IDs. This ensures that the data are represented in a standardized way, are common across many institutions, and are easily retrievable using standardized search methodologies. The vocabulary tables are available to provide the names and relationships among these different representations.

Resources can help. Don’t know what the standardized vocabulary is for your search term? Check out Athena, a platform that maps OMOP standardized vocabularies to other nonstandard vocabularies. Want to take a deep dive into OMOP? Discover more on the Github Wiki.

Which OMOP Tables Does All of Us Use?

The All of Us dataset includes EHR data found in the following OMOP tables:

Person Contains basic demographic information describing a person including sex assigned at birth, birth date, race, and ethnicity. Although it is common to get this information from EHR data, the All of Us Program uses data provided directly by participants through surveys when this information is available.
Visit_occurrence Visits capture encounters with health care providers or similar events. Contains the type of visit a person has (outpatient care, inpatient confinement, emergency room, or long-term care), as well as date and duration information. Rows in other tables can reference this table, e.g., condition occurrences related to a specific visit.
Condition_occurrence Conditions are records of a person indicating the presence of a disease or medical condition stated as a diagnosis, a sign, or a symptom, which is either observed by a provider or reported by the patient.
Drug_exposure Captures records about the utilization of a medication. Drug exposures include prescription and over-the-counter medicines, vaccines, and large-molecule biologic therapies. Radiological devices ingested or applied locally do not count as drugs. Drug exposure is inferred from clinical events associated with orders, prescriptions written, pharmacy dispensings, procedural administrations, and other patient-reported information.
Measurement Contains both orders and results of a systematic and standardized examination or testing of a person or person’s sample, including laboratory tests, vital signs, and quantitative findings from pathology reports. Physical measurements collected by All of Us are also stored in this table.
Procedure_occurrence Contains records of activities or processes ordered or carried out by a health care provider for a diagnostic or therapeutic purpose.
Observation Captures clinical facts about a person obtained in the context of examination, questioning, or a procedure. Any data that cannot be represented by any other domains, such as social and lifestyle facts, medical history, family history, etc. are recorded here. Survey information is also located in this table.
Device_exposure Captures information about a person’s exposure to a foreign physical object or instrument used for diagnostic or therapeutic purposes. Devices include implantable objects (e.g., pacemakers, stents, artificial joints), blood transfusions, medical equipment and supplies (e.g., bandages, crutches, syringes), other instruments used in medical procedures (e.g., sutures, defibrillators), and material used in clinical care (e.g., adhesives, body material, dental material, surgical material).
Death Contains the clinical events surrounding how and when a person dies.
Fact_relationship Contains records about the relationships between facts stored as records in any table of the CDM. Relationships can be defined between facts from the same domain or different domains. Examples of fact relationships include person relationships (parent–child), care site relationships (hierarchical organizational structure of facilities within a health system), etc.
Specimen Contains the records identifying biological samples from a person.

Wearables Data Model

Mobile Health

As an important first step in integrating wearables for data collection, All of Us participants with any Fitbit device who wish to share Fitbit data with the program may do so. Participants may choose what type of data to share and may stop sharing at any time.

A Fitbit is an activity tracker, usually worn on the wrist, which can track the distance you walk, run, swim, or cycle, as well as the number of calories you burn and take in. Some also monitor your heart rate and sleep quality.

In the All of Us Research Program, this type of participant data is known as a “wearables” data. Fitbit data are the first of this data type to be included in the Researcher Workbench.

Fitbit data are available in a series of tables within the Researcher Workbench. Please note that these tables can be linked to the data from OMOP-formatted tables by person_id.

Below are all the currently available tables with the data format for each field along with some notes to consider when using Fitbit data.

Heart Rate (By Zone Summary)

person_id integer
datetime datetime
zone_name string
min_heart_rate integer
max_heart_rate integer
minutes_in_zone integer
calories_out float

Heart Rate (Minute-Level)

person_id integer
datetime datetime
heart_rate_value integer

Activity (Daily Summary)

person_id integer
date date
activity_calories float
calories_bmr float
calories_out float
elevation float
fairly_active_minutes float
floors integer
lightly_active_minutes float
marginal_calories float
sedentary_minutes float
steps integer
very_active_minutes float

Activity Intraday Steps (Minute-Level)

person_id integer
datetime datetime
steps numeric