AI allows educational measurement design and development to move at unprecedented speeds, which raises the question of seatbelts. During a recent panel on Upgrading the Quality and Explainability of Educational Measurement in the AI Era, the panel argued that as assessments shift to continuous, multimodal data streams, we must upgrade our definitions of scientific soundness and understanding of credibility.
Efficacy, Validity, and Fairness by Design
As the Senior Vice President of Research at ETS, Kadriye Ercikan is working to advance the soundness of approaches to real-time scoring and feedback, personalization and interactivity, and adaptivity. To ensure quality, Ercikan advocates for responsible AI and three non-negotiable pillars: Efficacy (does it meet its targeted goals), Validity does evidence support the intended use), and Fairness (are goals consistently met across subgroups).
Ercikan is leading teams designing the “seatbelts” for the AI era—building the rigorous, scientific infrastructures that allow us to move fast without doing harm. Ercikan argues that fairness cannot be a “post-audit” activity. It must be “fairness by design,” baked into the architecture from the first sketch.
Ercikan envisions shifting from assessments as thermometers (which merely describe status) to thermostats (which measure temperature and activate systems to change conditions).
Efficient, Engaging, and Useful
Angela Bahng (Senior Program Officer, Gates Foundation) framed the challenge with a stark statistic: students can spend up to 100 hours annually on testing. This burden often falls heaviest on students of color and those behind grade level, creating an urgent need for systems that are not just technically sound, but more “efficient, engaging, and useful.”
To drive this shift, Bahng is leading work on a user-facing “product quality framework” designed to help district leaders cut through the noise. By defining quality through the lens of utility—asking if a tool is user-friendly, reliable, and meaningful for instruction—her work aims to empower schools to select tools that serve learning rather than merely auditing it.
Bahng highlighted AI “bright spots” demonstrating this potential, such as “behind-the-scenes” assessment using speech recognition, embedded video modules for teacher support, and AI reading coaches that provide simultaneous, real-time feedback.
Bahng projects that the field is on track to see rigorous evidence of these transformations within the next two to three years. Her vision is clear: by prioritizing student-centered design, we can transform assessment from a compliance burden into a tool that actively supports learning trajectories.
Why Measuring Failure Hasn’t Fueled Learning
Michelle Odemwingie, CEO of Achievement Network (ANet), offers a perspective that shifts the responsibility of assessment design: “Validity lives or dies at the moment of interpretation.” She argues that if an insight fails to inform an educator’s next move, it is not valid, regardless of its psychometric properties.
Currently, the field faces what Odemwingie referred to as information obesity. With schools utilizing over 2,700 EdTech tools, educators are overwhelmed by fragmented data and competing sources of truth. This incoherence makes it harder, not easier, to identify what matters and act on it.
Odemwingie also warns against “reasonable nonsense”—instances where AI projects confidence while delivering information that is credible-sounding but inaccurate. When mapping a student’s learning journey, she insists that good enough isn’t.
Ultimately, Odemwingie reframes the challenge from a technical one to a relational one. “We don”t actually have a measurement problem in education. We have a respect problem,” she asserts. Until assessment systems respect teachers’ judgment, time, and expertise (helping them reason with data rather than merely comply with it) technical sophistication won’t yield lasting value.
“Human Variability is Signal, Not Noise”
Gabriela “Gaby” López (co-author of this piece) argues that as we integrate AI into measurement, we must resist the urge to optimize solely for speed and prediction. Instead, she envisions systems designed for “human flourishing”—prioritizing growth, agency, and opportunity rather than defining student worth.
López challenges the field to remember that “people are not averages.” “Human variability is signal, not noise.” Designing for a narrow definition of typical makes results less accurate and less trustworthy.
Furthermore, López offers a critical redefinition of transparency for the AI era. “Explainability isn’t about exposing the code,” she clarifies. “It’s about helping people understand what results mean, what they don’t mean, and how they should be used.”
Ultimately, López reminds us that trust is not a technical achievement, but a relational one. It is not merely about accuracy, but about “whether people feel respected by a system”. She closes with a powerful reminder of the stakes involved: “Behind every data point, there is a person with a history, a community, and a future they’re trying to shape”.
Why AI Needs to Show Its Work
AI too often demands trust without showing its work. As we integrate multi-modal AI, we must build upon bedrock principles of quality. An assessment’s value lies not just in accuracy, but in its utility for learners. The story of AI in education must be one of openness, scientific rigor, and earned trust.
