Training Data for Machine Learning to Enhance Patient-Centered Outcomes Research Data Infrastructure

Project Background

Artificial intelligence (AI)/machine learning (ML) applications have the power to utilize large amounts of real-world clinical data in varied and complex formats to rapidly identify effective treatments, potentially accelerating clinical innovation and supporting evidence-based decisions in clinical settings. However, the lack of high-quality training data that can be used for developing clinically utilizable ML models has been a significant and ongoing impediment. A foundation of high-quality training data is critical to developing robust machine-learning models. Training data sets are essential to train prediction models that use machine learning algorithms, to extract features most relevant to specified research goals, and to reveal meaningful associations.

Through this project, ONC in partnership with National Institutes of Health (NIH) National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), advanced the application of AI/ML in patient-centered outcomes research (PCOR) by generating high quality training datasets for a chronic kidney disease (CKD) use case – predicting mortality within the first 90 days of dialysis.

Project Dates

This project began in 2019 and ends in September 2021.

Project Goal

The goal of this project was to conduct foundational work to support future applications of AI/ML to improve PCOR, health and health care delivery.

This project achieved its goal by:

  • Establishing a kidney disease-related use case that benefits from ML applications using data from the United States Renal Data System (USRDS)
  • Identifying and convening a multidisciplinary expert panel to provide input into the criteria for the development of high quality training datasets and provisionally testing these datasets using ML algorithms
  • Developing machine-learning models and using conventional metrics to evaluate their performance
  • Capturing lessons learned from the process of developing the training datasets and ML models
  • Disseminating project tools, activities, and points to consider/lessons learned to encourage future applications of these methods by PCOR researchers

Learn More

Final Report

To learn more about how this Project advanced the application of machine learning in patient-centered outcomes research (PCOR), read the Final Report [PDF - 1.9 MB].

Implementation Guide and Code

Access the Implementation Guide for this Project:

The code used to develop the training datasets and ML models are available on ONC GitHub.

Project Infographic

The infographic [PDF - 1.1 MB] provides a visual overview of the project and includes the goal and objectives, data sources and use case selected, and methodology and results from building the training dataset and ML models for the kidney disease use case.


Read this blog to understand how machine learning was applied to address a kidney disease use case. 

Project Webinar

Please contact with questions about this project.