Evaluating ML/AI Models in Clinical Research

The number of machine learning (ML) and artificial intelligence (AI) models published in clinical research is increasing yearly. Whether clinicians choose to dive deep into the mathematical and computer science underpinnings of these algorithms or simply want to be conscientious consumers of new and relevant research to their line of work, it is important to become familiar with reading literature in this field.

To that end, Quer et al. recently wrote a State-of-the-Art Review in the Journal of The American College of Cardiology detailing the research landscape for ML and AI within cardiology including concrete tips on how a non-ML expert can interpret these studies. At its core, ML is about prediction, and models are created to make accurate predictions on new or unseen data. Inspired by their work and incorporating many of their recommendations, below is a list of considerations for when you are critically evaluating an ML/AI model in clinical research:

  1. What question is addressed and what problem tackled? How important is it? Regardless of a model’s performance or the accuracy, its usefulness is determined by its clinical application. Everything must go back to the patient.
  2. How does the ML/AI model compare to traditional models for the given task? Many studies have shown little additional benefit when comparing ML/AI models to standard statistical approaches including logistic regression for clinical questions that have been extensively researched in the past with key predictors of the outcome of interest identified. The promise of ML/AI really exists in incorporating novel data sources and data structures, including time-series information and continuous input from wearable sensors, raw images and signals such as that from common studies including echos and ECGs, and harmonizing unique data types together.
  3. To which broad category does the model fall into? Most machine learning models fall into buckets of supervised learning algorithms, unsupervised learning algorithms, or reinforcement learning. Each approach is slightly different with a unique end product. Supervised learning algorithms learn patterns in the data that allow them to predict whether a specific observation falls within a specific class or category, for example determining if a photo is a cat or a dog. This requires data that is labeled for the algorithm to learn from, i.e. someone or something has provided data that is correctly tagged as a dog or cat. Unsupervised learning does not require observations with labels but instead combs through the observations to look for those that are similar to each other. Reinforcement learning a separate task in which an agent is trained to optimize choices made to attain a stated goal. All of these have been used clinically in recent literature.
  4. How were the data and labels generated? Garbage in = garbage out. Your model is only as good as the data it was trained on and the accuracy of the labels. It’s important to know where this information came from.
  5. Model training, validation/performance, generalizability. A common approach to training models is to split the data into a training set with unique observations left for the test set to validate the model. It is critical to train and test on different data with no overlap. Model performance is tracked with metrics similar to those used to evaluate clinical models, including sensitivity, specificity, positive predictive value, negative predictive value, and AUC, although the names associated with those measures may be different. Additional measures such as an F-score may be used. Arguably more important, however, is generalizability. This is how well the model performs in an entirely unique cohort, often from another center, although many of the currently published studies do not include this step.
  6. How clinically useful are these findings, and is the model interpretable? Basically, is the juice worth the squeeze? And can a human understand why the model made its conclusion? A common knock against deep learning neural networks for example is that although they are incredibly skilled at learning from data and making accurate predictions on new data, how they do so is a “black box,” although new ML/AI methods have started to account for this.
  7. How reproducible are the results? Did the authors share their code or dataset? If they used an EHR phenotype to generate their cohort, can you do the same thing at your institution?

These points are meant to summarize and add to some important aspects of this recently published article, but it is an excellent read and I encourage everyone to review it in its entirety.


Quer, G., et al. (2021). “Machine Learning and the Future of Cardiovascular Care.” Journal of the American College of Cardiology 77(3): 300-313.


“The views, opinions and positions expressed within this blog are those of the author(s) alone and do not represent those of the American Heart Association. The accuracy, completeness and validity of any statements made within this article are not guaranteed. We accept no liability for any errors, omissions or representations. The copyright of this content belongs to the author and any liability with regards to infringement of intellectual property rights remains with them. The Early Career Voice blog is not intended to provide medical advice or treatment. Only your healthcare provider can provide that. The American Heart Association recommends that you consult your healthcare provider regarding your personal health matters. If you think you are having a heart attack, stroke or another emergency, please call 911 immediately.”