To ensure the Research Hub collects the highest quality data possible, the All of Us Research Program employs a comprehensive data methodology to curate data for registered researchers. Read more about the program protocol.
Data Curation Process
The All of Us Research Program collects data from a wide variety of sources, including surveys, electronic health records (EHRs), biosamples, physical measurements, and wearables like Fitbit.
The All of Us Research Program uses the OMOP CDM to ensure EHR data are standardized for all researchers.
After harmonizing the data to meet the specifications of the OMOP CDM, we process the data to ensure participant privacy is protected. We also take steps to conform and clean the data to deliver high quality data. See the All of Us Data Dictionary for more information.
All of Us Research Program data in its final format, after harmonization and refinement, are referred to as a curated dataset. Two separate datasets are available:
The Public Tier curated dataset may be accessed through the interactive Data Browser application.
The Registered Tier curated dataset may be accessed by approved researchers in the Researcher Workbench. Registered Tier data includes data from survey answers, physical measurements taken at the time of participant enrollment, longitudinal EHRs, and wearables like Fitbit. This individual-level data can be analyzed within the Researcher Workbench. To explore the privacy methodology for the Registered Tier in detail visit the Data Access & Use page.
The All of Us Data Dictionary documents what data are available from participants and what modifications the program makes to protect participant privacy. It provides a description for each data field, noting whether it is a standard OMOP field or a custom field created to help capture data unique to the program. The Data Dictionary also provides information on whether the data in each field come from participant health records or from information the participants provide themselves, like survey data. The Data Dictionary details some ways we clean the data to improve data quality, as well as many of the program custom concept IDs for easy reference. This resource includes versioning data, so you can see what has been changed, added, or removed since the previous curated dataset.
Check out the Data Dictionary.
GENOMIC DATA CURATION
Genomic data are now available in the All of Us dataset in the form of whole genome sequencing (WGS). The All of Us genomics data available in the Controlled Tier are unique, not only in that data are contributed by a largely diverse group of research participants, but that it will eventually be scaled up to include data from 1 million or more people. Providing researchers with a large, diverse genomics dataset will promote research in previously unreported areas and in those groups historically underrepresented in biomedical research.
Researchers are able to combine these data with other data provided by participants, such as EHR data, survey data, physical measurements, and any data provided from wearables like Fitbit. Researchers who are interested in utilizing genomics data can do so through the custom point-and-click tools unique to the Researcher Workbench. When creating a cohort, researchers will be able to select an option to include participants that have any whole genome sequence (WGS) variant data in their participant cohort. Users can then select to include WGS variant data as part of their dataset. Once selected, WGS variant data will be extracted from the genomic dataset and saved as VCF (Variant Call Format) files for export to a Jupyter Notebook for analysis using Hail, PLINK, or other analysis tool of their choosing. Learn more about the tools available in the Researcher Workbench.
Observational Medical Outcomes Partnership (OMOP) Common Data Model
The OMOP CDM is maintained by an international collaborative called the Observational Health Data Sciences and Informatics (OHDSI) program. The All of Us Data and Research Center leverages the OMOP CDM to empower researchers by using existing, standardized vocabularies and a harmonized data representation. These factors enable connection to other ontologies, datasets, and tools that use the same codes or data model. Learn more about OHDSI’s OMOP CDM initiative.
As a researcher, here’s what you should know about OMOP:
OMOP is a relational database. A relational database is a set of formally described tables with defined relationships allowing data to be accessed in many different ways. For researchers, it may be helpful to get familiar with the curated dataset’s OMOP Tables.
OMOP is standardized. Standard vocabularies mean that, despite differences in how each data element may be captured (e.g., variation among the many electronic health records), all of the data are represented consistently in the data model. For each broad category of data, or “domain,” OMOP incorporates important existing vocabularies so that everyone using the data can speak the same language.
OMOP is where metadata rules. The use of these vocabularies and “concept IDs” allow flexibility in extracting data. Instead of just the source data, which are often highly specific to individual institutions, OMOP provides concept IDs. This ensures that the data are represented in a standardized way, are common across many institutions, and are easily retrievable using standardized search methodologies. The vocabulary tables are available to provide the names and relationships among these different representations.
Resources can help. Don’t know what the standardized vocabulary is for your search term? Check out Athena, a platform that maps OMOP standardized vocabularies to other nonstandard vocabularies. Want to take a deep dive into OMOP? Discover more on the Github Wiki.
Which OMOP Tables Does All of Us Use?
The All of Us dataset includes EHR data found in the following OMOP tables:
|Person||Contains basic demographic information describing a person including biological sex, birth date, race, and ethnicity. Although it is common to get this information from EHR data, the All of Us Program uses data provided directly by participants through surveys when this information is available.|
|Visit_occurrence||Visits capture encounters with health care providers or similar events. Contains the type of visit a person has (outpatient care, inpatient confinement, emergency room, or long-term care), as well as date and duration information. Rows in other tables can reference this table, e.g., condition occurrences related to a specific visit.|
|Condition_occurrence||Conditions are records of a person indicating the presence of a disease or medical condition stated as a diagnosis, a sign, or a symptom, which is either observed by a provider or reported by the patient.|
|Drug_exposure||Captures records about the utilization of a medication. Drug exposures include prescription and over-the-counter medicines, vaccines, and large-molecule biologic therapies. Radiological devices ingested or applied locally do not count as drugs. Drug exposure is inferred from clinical events associated with orders, prescriptions written, pharmacy dispensings, procedural administrations, and other patient-reported information.|
|Measurement||Contains both orders and results of a systematic and standardized examination or testing of a person or person’s sample, including laboratory tests, vital signs, and quantitative findings from pathology reports. Physical measurements collected by All of Us are also stored in this table.|
|Procedure_occurrence||Contains records of activities or processes ordered or carried out by a health care provider for a diagnostic or therapeutic purpose.|
|Observation||Captures clinical facts about a person obtained in the context of examination, questioning, or a procedure. Any data that cannot be represented by any other domains, such as social and lifestyle facts, medical history, family history, etc. are recorded here. Survey information is also located in this table.|
|Device_exposure||Captures information about a person’s exposure to a foreign physical object or instrument used for diagnostic or therapeutic purposes. Devices include implantable objects (e.g., pacemakers, stents, artificial joints), blood transfusions, medical equipment and supplies (e.g., bandages, crutches, syringes), other instruments used in medical procedures (e.g., sutures, defibrillators), and material used in clinical care (e.g., adhesives, body material, dental material, surgical material).|
|Death||Contains the clinical events surrounding how and when a person dies.|
|Fact_relationship||Contains records about the relationships between facts stored as records in any table of the CDM. Relationships can be defined between facts from the same domain or different domains. Examples of fact relationships include person relationships (parent–child), care site relationships (hierarchical organizational structure of facilities within a health system), etc.|
|Specimen||Contains the records identifying biological samples from a person.|
Wearables Data Model
As an important first step in integrating wearables for data collection, All of Us participants with any Fitbit device who wish to share Fitbit data with the program may do so. Participants may choose what type of data to share and may stop sharing at any time.
A Fitbit is an activity tracker, usually worn on the wrist, which can track the distance you walk, run, swim, or cycle, as well as the number of calories you burn and take in. Some also monitor your heart rate and sleep quality.
In the All of Us Research Program, this type of participant data is known as a “wearable”, and Fitbit data are the first of this data type to be included in the Researcher Workbench.
Fitbit data are available in a series of tables within the Researcher Workbench. Please note that these tables can be linked to the data from OMOP formatted tables by person_id.
Below are all the currently available tables with the data format for each field along with some notes to consider when using Fitbit data.