Accelerating Patient-Centered Outcomes Research through Synthetic Health Data Generation

Stephanie Garcia and Carmen Smiley | September 19, 2022

Real world health data are critical for Patient-Centered Outcomes Research (PCOR). However, it’s often difficult, expensive, and time consuming for researchers to access real-world clinical health data because of privacy concerns, security restrictions, and usage issues. Although PCOR researchers, health information technology developers, and informaticists often depend on anonymized or de-identified clinical health data for testing theories, data models, algorithms, and prototype innovations, re-identification of anonymized data remains a possible security risk. Synthetic health data can provide a no-risk data source to complement research and support testing needs until real clinical health data are available.

In 2019, ONC launched the Synthetic Health Data Project (Project), enabling Synthea™—a synthetic health data generation engine created by MITRE Corporation—to produce high-quality synthetic health data in specific areas, and more varied and plentiful synthetic health records. The Project focused on the creation of synthetic data to support PCOR for patients with complex care needs, opioid use, and pediatric populations. Synthea was chosen because it is free, open-source, community-driven, shareable, and reliable. It also has the capacity to generate realistic health data for fictitious patients and generates data randomly and independently from publicly available datasets, so it’s free of protected health information (PHI) and personally identifiable information (PII).

Enhancing Data Infrastructure for PCOR and Health IT Development

Synthea modules, which drive the creation of the synthetic health data, are built using clinical care guidelines and standards of care and are informed by input from clinical experts and published studies. The modules capture the parameters for conditions, encounters, and laboratory tests, as well as other clinical and demographic information. The Project built and validated five new modules: Cerebral Palsy, Prescribing Opioids for Chronic Pain and Treatment of Opioid Use Disorder, Sepsis, Spina Bifida, and Acute Myeloid Leukemia. These modules are now part of the gallery of Synthea modules that are freely available for use. Accompanying Companion Guides were created to provide researchers and developers with essential technical information about each module.

Demonstrating Expanded Potential Uses of Synthea

The Project included a demonstration evaluating the application of Synthea using simulation studies. The demonstration successfully replicated a published simulation study, which suggests a range of expanded potential research uses for Synthea. Read more about the demonstration study in this recently published article.

Engaging the Community

Through the Synthetic Health Data Challenge, the Project sought to broaden community awareness of Synthea and its capabilities. The Challenge inspired innovators, researchers, and technology developers to demonstrate novel uses of Synthea and to validate the realism of the synthetic health records generated. A total of $100,000 in prizes were awarded to the top six competitors, and winners presented their innovative solutions during the Winning Solutions Webinar.

Resources for the Community

Analysis of the Project’s successes and challenges informed enhancements to Synthea software, guidance, and documentation that will help make Synthea software more capable and user-friendly for researchers and developers, while providing additional support to encourage novice users.  Visit the Project homepage to access the Synthetic Health Data Generation to Accelerate Patient-Centered Outcomes Research Final Report, modules, companion guides, technical guidance and tips for using Synthea and other project materials. Synthea relies on community participation, so efforts to enhance and maintain community-wide stewardship for this open-source and free software are essential to its overall success and continuing usefulness to PCOR stakeholders.

The Synthetic Health Data Project was funded by the Patient-Centered Outcomes Research Trust Fund (PCORTF) and administered by the Assistant Secretary for Planning and Evaluation (ASPE) at the U.S. Department of Health and Human Services, which supports the development of data capacity and infrastructure that can engage patients in health care decision-making and incorporate their responses into research.