Educational assessment has long privileged the needs of policymakers. The era of multimodal artificial intelligence (AI) forces an overdue reckoning: For whom do we measure?
If, through neglect or expediency, we inadvertently allow advancing algorithms to automate and amplify the multiple-choice paradigm, we will squander a generational opportunity. By grounding technological leaps in the sciences of learning, measurement, and improvement, we have the potential to leverage AI to rebalance educational assessment design, driving breakthroughs in efficiency, usefulness, and usability.
Linda Darling-Hammond, Elizabeth Mokyr Horner, Paul LeMahieu, and John Hattie recently joined a Study Group webinar with a charge: How might we shift the balance from extracting lagging indicators to designing dynamic “systems of use” that genuinely center the needs of learners, educators, and families?
Measuring for the Classroom, Not the Capitol
Linda Darling-Hammond, from the Learning Policy Institute, urges the field to flip the script and design assessments primarily for students and teachers, rather than for external audiences.
Shifting the audience entails rethinking assessment design. The most promising approaches are not interruptions to instruction; they are “assessments that are themselves learning experiences.”
Historically, the potential of rich, open-ended performance assessments—such as digital portfolios, badges, and capstone projects—was, as Darling-Hammond puts it, stymied by time and logistical constraints. AI potentially makes it possible for educators to scale high-agency tasks, standardizing the evaluation criteria and rubrics, and personalizing tasks and feedback cycles.
When it comes to AI, Darling-Hammond advises moving past the panic over cheating, and instead positioning AI as capable of generating a foundation of information that learners are challenged to “critique, evaluate, use, and then transform.”
Educators must remain deeply involved, she argues. The future of assessment is a partnership, enabling teachers to remain part of the scoring process, because we learn by seeing the student work rather than handing it over entirely to AI.
Minimizing Burden, Maximizing Usefulness
Elizabeth Mokyr Horner, from the Gates Foundation, challenges us to rethink how we capture assessment data, warning against deploying AI without explicit design intent.
Classroom needs must dictate technology’s affordances. Mokyr Horner warns that simply asking AI to automate what we already do merely builds a “faster horse” and risks amplifying past ineffectiveness.
Current AI models are largely built for commercial use, not for generating pedagogically relevant insights. To be effective, especially in early learning, assessments must move beyond traditional testing formats and become seamlessly embedded in developmentally appropriate activities.
More broadly, by utilizing voice, listening, drawing, and physical-digital interfaces, AI can gather data seamlessly. Mokyr Horner points out that “there are also now wands that can be used with real books or augmented reality opportunities.”
Efficacy means catching learning barriers early by embedding measurement into everyday routines without depleting instructional time. While “automatic speech recognition is improving dramatically,” Mokyr Horner explains, work remains. To avoid exacerbating biases, models must be trained on a wide variety of young voices rather than adult, dominant dialects.
Practical Measurement for Improvement
Paul LeMahieu, from the Carnegie Foundation for the Advancement of Teaching, champions the framework of practical measurement for improvement, shifting the emphasis to practical utility and validity-in-use. A perfectly reliable test that serves only as a lagging indicator is a failed measurement for educators, he argues. AI’s value lies in capturing leading indicators that make complex data relevant and actionable.
Rather than obsessing over average scores, practical measurement views “variability in performance as the problem to solve.” By leveraging AI to analyze such variance, educators can continuously answer the defining questions of improvement science: What works, for whom, and under what conditions?
Above all, leaders must architect intentional “systems of use.” Without the infrastructure and “routines for collaborative sensemaking,” even the most advanced AI tools will merely provoke surface-level compliance rather than genuine improvement.
Cultivating Assessment-Capable Learners
John Hattie, of the University of Melbourne, argues that when students own their own data, they transition from passive subjects to active agents. To achieve this, we must build “assessment capability.” We must “teach the students how to interpret the results” so that they manage improvement, empowering them to actively answer the critical learning question: “Where to next?”
In the multimodal AI era, assessment capability is inextricably linked to AI literacy. Teaching students to interpret their test results uses the same evaluative thinking required to critique and transform AI-generated outputs. Pointing to 12,000 hours of transcripts which constrained zero teacher invitations to students to “think aloud,” Hattie argues AI can capture critical reasoning—but only if classrooms cultivate a high-trust climate where failure is embraced as a learner’s best friend.
Students must learn the art of asking probative questions because, as Hattie warns about prompting AI, “’If you ask the wrong question, you get the wrong result.”
Conclusion
Multimodal AI and the science of learning offer a rare window to upgrade educational measurement. Instead of simply automating existing tests, we must innovate the what, how, and for whom of assessment innovation to be useful by design. By leveraging ambient data capture and designing for validity-in-use, we can maintain and enhance achievement while transforming measurement into a catalyst for human flourishing.
