Aggregate accuracy hides heterogeneity. A model that achieves strong performance on a held-out test set may do so by performing very well for the majority group in that set while performing much worse for underrepresented groups. When only the aggregate number is reported, that disparity is invisible until the model is deployed in a context where the underrepresented group is larger or where the gap has real consequences.
Health applications have a higher obligation here than most. Health disparities already exist in care access, research representation, and outcomes. An AI system that performs worse for people who are already underserved by the health system can deepen those disparities rather than reduce them. Equity-aware validation is the practice of making subgroup performance an explicit part of the evaluation protocol, not a supplemental analysis conducted after the primary results are established.
What equity-aware evaluation requires
It requires having enough data from diverse populations to evaluate subgroup performance with statistical power. That is a data collection and partnership challenge, not only a modeling challenge. It requires defining which subgroups are relevant before analysis and reporting on all of them, including cases where the model performs poorly. It requires that subgroup results are part of the model card and visible to anyone considering using the model.
It also requires intellectual honesty about what we do not yet know. When a population is underrepresented in our data, we need to say so clearly rather than remaining silent about it. A gap in coverage is a known limitation that should be documented, not an omission that can be addressed later when it becomes inconvenient.
Open notebook
Our validation protocols and equity evaluation practices are still being developed. This page is part of our open notebook and will be updated as we formalize our standards and report results.