To ensure the Research Hub collects the highest quality data possible, the All of Us Research Program employs a comprehensive data methodology to curate data for registered researchers.
All of Us Research Program collects data from a wide variety of sources, including surveys, electronic health records (EHRs), biosamples, physical measurements, and wearables like Fitbit.
EHR Data Harmonization
The All of Us Research Program uses the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) to standardize EHR data for all researchers.
After harmonizing the EHR data to meet the specifications of the OMOP CDM, we process the data to ensure participant privacy is protected. We also take steps to conform and clean the data to deliver high-quality data.
All of Us Research Program data in its final format, after harmonization and refinement, are referred to as a curated dataset. Three different levels of information are available:
The Public Tier dataset displays high-level summaries of the data available for research. Through the Data Browser, one can explore anonymized, aggregated participant data and summary statistics.
The Registered Tier curated dataset may be accessed by registered researchers in the Researcher Workbench. Registered Tier data includes individual-level data from surveys, physical measurements taken at the time of participant enrollment, longitudinal EHRs, and wearables like Fitbit. This individual-level data must be analyzed within the secure Researcher Workbench.
The Controlled Tier curated dataset may be accessed in the Researcher Workbench by registered researchers who complete all necessary requirements. In addition to all of the data in the Registered Tier, the Controlled Tier data include genomic data and more detailed demographic, EHR, and survey data than the Registered Tier.
Visit the Data Access Tiers page to learn more about our tiered-data access model.
The All of Us Data Dictionary documents what data are available from participants and what modifications the program makes to protect participant privacy. It provides a description for each data field, noting whether it is a standard OMOP field or a custom field created to help capture data unique to the program. The Data Dictionary also provides information on whether the data in each field come from participant health records or from information the participants provide themselves, like survey data. The Data Dictionary details some ways we clean the data to improve data quality, as well as many of the program custom concept IDs for easy reference. This resource includes versioning data so you can see what has been changed, added, or removed since the previous curated dataset.
GENOMIC DATA CURATION
Individual-level genomic data from whole genome sequencing (WGS) and genome-wide genotyping arrays are available within the Researcher Workbench.
Registered researchers who have Controlled Tier access have access to WGS and genotyping array variant data in multiple formats, such as VCF, PLINK and HAIL MatrixTable, and variant annotations (e.g., their genetic and clinical significance).
Controlled Tier researchers also have access to auxiliary information, such as computed ancestry and quality reports. Quality control (QC) methods confirm genetic variants within DNA sequences. We use a method known as joint calling that improves the QC of this data by combining evidence from multiple samples to filter out systematic biases.
WHOLE GENOME SEQUENCING QUALITY CONTROL
The All of Us Research Program performs stringent quality control (QC) procedures to ensure that we provide researchers with genomic data of the highest quality.
1. Single Sample QC
We run QC processes on each participant sample. These processes help us detect if samples have been swapped, contaminated, or prepared incorrectly. This includes verifying genotype fingerprints, identifying appropriate sex chromosomes, and other QC checks. Cross-contamination rates are less than 3% for All of Us genomic data.
2. Joint Callset QC (WGS only)
We also run QC processes on the joint callset, which uses information across samples to flag samples and identify variants. These QC steps help us detect noisy samples, remove artifacts, and ensure sequencing quality meets genomic data standards.
3. Downstream Validity
All of Us runs analyses to demonstrate specific capabilities of the data and to communicate caveats in the data to researchers. We do this through Genome-Wide Association Studies (GWAS). GWAS help us validate the All of Us genomic data by replicating existing results. In the future, we will release reports describing variant data across populations that will let researchers account for genetic allele frequencies in their work. This information will improve the quality of associations they might find. These reports will be available in the Researcher Workbench User Support Hub to all registered users of the Researcher Workbench.
GENOMIC DATA QUALITY REPORTS
Genomic Research Data Quality Report
Phenotype-Genotype Association Replication using the Whole Genome Sequencing Dataset
LDL Cholesterol GWAS Association Replication using the Whole Genome Sequencing dataset
Observational Medical Outcomes Partnership (OMOP) Common Data Model
The Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) is maintained by an international collaborative called the Observational Health Data Sciences and Informatics (OHDSI) program. The All of Us Data and Research Center leverages the OMOP CDM to empower researchers by using existing, standardized vocabularies and a harmonized data representation. These factors enable connection to other ontologies, datasets, and tools that use the same codes or data model. Learn more about OHDSI’s OMOP CDM initiative.
As a researcher, here’s what you should know about OMOP:
OMOP is a relational database. A relational database is a set of formally described tables with defined relationships allowing data to be accessed in many different ways. For researchers, it may be helpful to get familiar with the curated dataset’s OMOP Tables.
OMOP is standardized. Standard vocabularies mean that, despite differences in how each data element may be captured (e.g., variation among the many electronic health records), all of the data are represented consistently in the data model. For each broad category of data, or domain, OMOP incorporates important existing vocabularies so that everyone using the data can speak the same language.
OMOP is where metadata rules. The use of these vocabularies and concept IDs allow flexibility in extracting data. Instead of just the source data, which are often highly specific to individual institutions, OMOP provides concept IDs. This ensures that the data are represented in a standardized way, are common across many institutions, and are easily retrievable using standardized search methodologies. The vocabulary tables are available to provide the names and relationships among these different representations.
Resources can help. Don’t know what the standardized vocabulary is for your search term? Check out Athena, a platform that maps OMOP standardized vocabularies to other nonstandard vocabularies. Want to take a deep dive into OMOP? Discover more on the Github Wiki.
Which OMOP Tables Does All of Us Use?
The All of Us dataset includes EHR data found in the following OMOP tables:
|Person||Contains basic demographic information describing a person including sex assigned at birth, birth date, race, and ethnicity. Although it is common to get this information from EHR data, the All of Us Program uses data provided directly by participants through surveys when this information is available.|
|Visit_occurrence||Visits capture encounters with health care providers or similar events. Contains the type of visit a person has (outpatient care, inpatient confinement, emergency room, or long-term care), as well as date and duration information. Rows in other tables can reference this table, e.g., condition occurrences related to a specific visit.|
|Condition_occurrence||Conditions are records of a person indicating the presence of a disease or medical condition stated as a diagnosis, a sign, or a symptom, which is either observed by a provider or reported by the patient.|
|Drug_exposure||Captures records about the utilization of a medication. Drug exposures include prescription and over-the-counter medicines, vaccines, and large-molecule biologic therapies. Radiological devices ingested or applied locally do not count as drugs. Drug exposure is inferred from clinical events associated with orders, prescriptions written, pharmacy dispensings, procedural administrations, and other patient-reported information.|
|Measurement||Contains both orders and results of a systematic and standardized examination or testing of a person or person’s sample, including laboratory tests, vital signs, and quantitative findings from pathology reports. Physical measurements collected by All of Us are also stored in this table.|
|Procedure_occurrence||Contains records of activities or processes ordered or carried out by a health care provider for a diagnostic or therapeutic purpose.|
|Observation||Captures clinical facts about a person obtained in the context of examination, questioning, or a procedure. Any data that cannot be represented by any other domains, such as social and lifestyle facts, medical history, family history, etc. are recorded here. Survey information is also located in this table.|
|Device_exposure||Captures information about a person’s exposure to a foreign physical object or instrument used for diagnostic or therapeutic purposes. Devices include implantable objects (e.g., pacemakers, stents, artificial joints), blood transfusions, medical equipment and supplies (e.g., bandages, crutches, syringes), other instruments used in medical procedures (e.g., sutures, defibrillators), and material used in clinical care (e.g., adhesives, body material, dental material, surgical material).|
|Death||Contains the clinical events surrounding how and when a person dies.|
|Fact_relationship||Contains records about the relationships between facts stored as records in any table of the CDM. Relationships can be defined between facts from the same domain or different domains. Examples of fact relationships include person relationships (parent–child), care site relationships (hierarchical organizational structure of facilities within a health system), etc.|
|Specimen||Contains the records identifying biological samples from a person.|
Wearables Data Model
As an important first step in integrating wearables for data collection, All of Us participants with any Fitbit device who wish to share Fitbit data with the program may do so. Participants may choose what type of data to share and may stop sharing at any time.
A Fitbit is an activity tracker, usually worn on the wrist, which can track the distance you walk, run, swim, or cycle, as well as the number of calories you burn and take in. Some also monitor your heart rate and sleep quality.
In the All of Us Research Program, this type of participant data is known as a “wearables” data. Fitbit data are the first of this data type to be included in the Researcher Workbench.
Fitbit data are available in a series of tables within the Researcher Workbench. Please note that these tables can be linked to the data from OMOP-formatted tables by person_id.
Below are all the currently available tables with the data format for each field along with some notes to consider when using Fitbit data.