A Rubric for Evaluating Healthcare AI Training Data

Mar 18, 2025

In today’s rapidly changing AI landscape, a major trend is emerging: simply scaling the volume of data for model training will not be enough to create the next generation of models. There are diminishing returns to adding greater volumes of data towards training large, dense models, and instead, AI development is headed towards a landscape of many models, all of which are finer variations that are well-suited for specific tasks. The curation of an ideal training dataset becomes a multi-constraint optimization problem, where the developer must consider inclusion and exclusion tradeoffs across multiple dimensions of the data to maximize the volume and diversity of information provided to the model. This, then, poses a question of how to evaluate whether a training dataset is appropriate for such specific model development.

This post serves as a set of guidelines for AI developers when evaluating a training data asset based on some of our key learnings. The past year has reinforced the paradox that, while data is abundant, it is often sparse in its most meaningful dimensions.

Traditional datasets are not meeting today’s AI needs

Most existing datasets are not purpose-built for AI's unstructured, multimodal data needs; rather, they consist of records aggregated from multiple sources that span various lengths of time. One limitation of conventional healthcare data is that it often captures only one or two modalities at a time and lacks longitudinal depth. Even for aggregators that combine data across several health systems, we have seen instances where 95% of patients are followed for less than 5 years. When building training data cohorts that also require patients with specific diseases, stages, and treatment protocols, an entire geography can contain less than 1000 patients.

In addition to limitations of data longitudinality, many traditional healthcare datasets consist of only structured data from EHRs and insurance claims. Furthermore, many of these datasets are curated from large, academic health systems that skew towards certain types of clinical conditions, serve patients who are more likely to participate in clinical trials, and are more concentrated in urban geographies. The ability to capture both breadth and depth of data through longitudinal patient journeys and all of the modalities used in real-world clinical decision-making while maintaining patient diversity is nearly impossible to achieve. Following the example above, if a training dataset also requires the inclusion of more than two modalities of data, availability can drop to below 100 patients. This is not nearly enough for training a generalizable AI model for healthcare applications, and developers are then tasked with the tremendous challenge of sourcing and curating a dataset across many different sources.

A Rubric for Evaluating Training Data

There is no standard formula for creating the optimal training dataset for AI development. The model’s intended use, deployment settings, and risk level are all critical pieces of information that must be weighed in the data curation process. We have compiled a set of recommendations for how organizations that are developing AI for healthcare should evaluate training data.

1. Connectedness of patient journeys

Curating long, comprehensive patient journeys is critical for predictive AI tasks, but it also serves a more basic purpose for training a broader set of models. Due to the fragmentation of the healthcare system, longitudinal patient data from a single source may still be filled with blind spots when care was delivered at other systems and facilities.

When evaluating how well-connected a training dataset is, some key questions to consider are:

Do the patient journeys in the dataset cover a long enough time horizon for the model to learn how a disease progresses (e.g. is it important for the model to learn from pre-diagnosis data)?
For all models, does the data contain gaps in care that may lead to blind spots for the model (e.g. are there components of the patient journey that are unobserved due to data capture from inaccessible sources)?
Is data longitudinality equal across different subpopulations within the training set, or is it concentrated on data from specific sources?

An example of where patient journey connectedness is critical to consider is when an AI developer is focused on certain inpatient procedures, and data is being curated from facilities where these procedures are conducted. More likely than not, many patients within this cohort receive routine outpatient care in other facilities that may not be connected to the inpatient facility. Their medical records from these outpatient facilities are likely full of relevant information for AI model development, however, and the single point-in-time inpatient data may leave major blind spots for the model if they are the only training data leveraged.

A strategy here is to link together multiple large-scale datasets that themselves integrate various sources to gain a straight shot at constructing comprehensive patient journeys, reducing unobservables, and capturing nuanced healthcare interactions. When visualizing patient follow-up periods from individual datasets, we observe that the vast majority of patients have less than 2 years of follow-up data, but when combining common patients across three sources, follow-up duration significantly increases—albeit at the cost of diversity and external validity (selection bias in who are the common patients). The only way to resolve this tradeoff is to expand aggregation further, ensuring both comprehensive longitudinal tracking and a more representative patient population.

Figure 1: Synthetic data describing patterns observed in traditional healthcare datasets, demonstrating heterogeneity in patient follow-up and the benefits of source aggregation. Percent of patients (y-axis) with max years of followup (x-axis) across average patients enrolled in three EHR aggregators compared to the percent of patients with max years of followup among only the combined common patients in the three sources.

2. Representation of subpopulations

It is of the utmost importance that diverse data are captured when training AI models, because anything that is left unobserved to the model during training will impact both internal validity (inference accuracy) and external validity (performance outside the training population). A key consideration when evaluating a training dataset is how well each relevant subpopulation is represented in the dataset. Covering a mode—distinct from the statistical definition of “mode”—refers to ensuring that every relevant subgroup within a dataset is sufficiently represented, even when some groups are naturally underrepresented. In large-scale AI training, models often perform well on majority populations but struggle with smaller, yet clinically significant, subgroups.

When evaluating how well-represented your subpopulations are in a training dataset, some key questions to consider are:

Along which axes (and combinations of axes) should subpopulations be considered distinct and meaningful (e.g. age bracket, race, gender, geography, disease stage)?
How well-represented is each subpopulation in the training dataset, when stratified across the axes above (e.g. white women between the ages of 30-39 from the southwest)?
If a subpopulation is underrepresented, can this group be upsampled from another existing data source?

For example, in a pretraining run selecting 20 million patients from a total corpus of 100 million, around 500,000 patients might come from smaller, often rural, healthcare communities. These patients are underrepresented, comprising just 0.5% of the total dataset. At a more granular level, in any given county or locality, this could translate to fewer than 1,000 patients. A general rule of thumb is that if fewer than 100 patients from a given parcel enter the training data, the model is effectively underpowered in that region.

One strategy to ensure representation of relevant subpopulations is to apply stratified sampling, by which small parcel representation can be adjusted via oversampling or upweighting. If all 500,000 in our example entered the tra8ining data, their representation would increase from 0.5% of the population to 2.5% in the sample. The key is to improve the model’s ability to generalize across different population densities without overcorrecting in a way that distorts real-world distributions. It's worth pointing out that a known alternative is to synthetically upsample the minority groups (synthetic generation of new records). While this technique does improve internal validity (model performance on smaller sub populations in training) but may not improve external validity (generalizability to sub populations outside the training data that never were sampled).

3. Richness of positive data points

Not all data are created equal, and depending on the health system data were collected in, the clinical area of focus, and the modality of data under consideration, there can be tremendous nuance and variation that will affect model performance. Optimizing small aspects of a training dataset – such as increasing high-quality data from minority populations or supplementing radiology reads with second opinion studies – can drive meaningful increases in model performance, sometimes on the order of single percentage points. When it comes to reasoning models (as opposed to pre-training) the richness of the positive cases (e.g. images with remarkable findings) becomes critical, as the model needs to learn as many different variations on what a positive case can look like. Attention to these small details during data curation will compound, leading to significant improvements in model performance.

When evaluating the richness of a training dataset, some key questions to consider are:

How much does human error factor into my training data? Can this be better accounted for? (e.g. is inter-reader variance known to be high? If so, include multiple reads in the training set.)
Are positive cases sufficiently varied for reasoning models?
Is the data richer across some subpopulations over others? (e.g. in oncology, minoritized populations often have sparser EMR records, which can introduce omitted variable biases.)

A real-world example of this phenomenon can be seen when comparing two models that determine the BI-RADS score of a mammography screen. Both models are trained on 10,000 mammography images with radiology reports. Model A is trained on 10,000 unique patients, while Model B is trained on 5,000 unique patients, each having a second opinion radiology report. Given the high rates of inter-rater variability in mammography reading, Model B will be able to observe the differences between high-accuracy and low-accuracy reads, and therefore be able to behave with higher accuracy, while this feature of mammography screening will be unobservable to Model A.

Making small optimizations throughout the data curation process can help improve model performance. While stratified sampling and upweighting address broad representation issues, continuing finer adjustments—such as balancing unremarkable and abnormal radiology reports—can refine training data.

4. Quantification of dissimilarity across multiple dimensions

Diversifying data across as many dimensions of a dataset is crucial for model generalizability. The more parameters there are over which to optimize diversity, the harder it is to rank observations as more or less diverse. To rank observations on several features of diversity one needs to think of this as an ordinal problem (not a cardinal problem). Actually quantifying the diversity of a training dataset enables AI developers to maximize diversity across a given dimension.

When evaluating multidimensional dissimilarity in a training dataset, some key questions to consider are:

Along what dimensions can the data show meaningful variation? And across what dimensions is diversity useful? (e.g. thoroughness of physician note-taking, providers’ represented panel sizes)
What tradeoffs exist between various indexes (e.g. the dissimilarity index vs the entropy index)?
Are there any axes along which the training data are very homogenous, and what are the selection bias consequences of this homogeneity?

In practice, an AI developer may have a large dataset of patients in a particular geographical region that contains many different physicians. In order to avoid curating a training dataset that is overly concentrated on particular care delivery patterns, the diversity of patient populations among all physicians should be maximized, in addition to maximizing diversity of other physician features such as thoroughness of note-taking.

A useful metric that can be employed here is the Dissimilarity Index (DI). Traditionally employed in demographic research to assess segregation, DI quantifies how unevenly two groups are distributed across subcategories. It ranges from 0 (complete integration) to 1 (complete segregation). Conceptually, DI can be interpreted as the fraction of one group that would need to be redistributed across categories to achieve an even distribution between two populations.

5. Capture of all relevant modalities

Most real-world clinical decision-making leverages multiple sources of information, yet today’s healthcare AI training datasets rarely go beyond a combination of EHR records, medical claims, and/or imaging data. Pushing the envelope to include data that are captured outside of these traditional sources, and accounting for variations within individual data modalities, is necessary for developing next-generation AI models.

When evaluating the inclusion of relevant modalities in a training dataset, some key questions to consider are:

What data elements are used to make similar decisions in a real-world setting?
For a given modality, what are the distinct features that need to be accounted for in the training data? (e.g. how many different CT scanner types exist, and does the training dataset include data from all of them?
For a given modality, are there distinct ways the data can be represented that must be considered? (e.g. an ECG can be represented in a PDF waveform, a medical report, and a waveform signal file)

Many of today’s traditional healthcare datasets reflect Scenario A or B above, requiring an AI developer to make a tradeoff between breadth and depth of data. As models become more specific to certain tasks, emphasis on both breadth and depth will be required, and multimodality of training data will become a requirement.

Going even further beyond the aforementioned data types – which are collected within the four walls of healthcare – there is an abundance of data collected outside of traditional healthcare settings such as frequency and activity patterns from devices that can be leveraged for better clinical-decision making and should be considered for inclusion in AI training data.

Acknowledgements: this piece a collaboration between Protege’s co-founder and Principal Scientist, Engy Zeidan, PhD, and Emily Lindemer, PhD, with contributions across the Protege team.

—————

View this article and subscribe to future updates on our substack here: https://withprotege.substack.com/p/a-rubric-for-evaluating-healthcare

‹ No, We're Not Running Out of Training Data

Check out Protege on Out-Of-Pocket! ›