Measuring AI's Impact on Student Learning: Open Questions in Education Assessment
While AI integration in higher education is expanding rapidly, critical gaps remain in measuring pedagogical impacts—from domain-specific cognitive outcomes to academic integrity safeguards.
Introduction
Artificial intelligence is reshaping how students learn. Universities are integrating AI-powered tools into curricula faster than researchers can measure their actual impact on learning outcomes. This creates a paradox: institutions adopt these systems with enthusiasm while fundamental questions about pedagogical effectiveness—and potential risks—remain unanswered.
The core tension is methodological. We know AI changes how students engage with material, but we lack rigorous frameworks to measure whether it deepens understanding or merely automates cognition away. This gap sits at the intersection of education research, cognitive science, and AI ethics, affecting millions of students globally.
Which Pedagogical Domains Are Most Vulnerable to Algorithmic Dependency?
AI tools affect different subjects differently. A mathematics student using an AI algebra tutor faces different cognitive risks than a literature student using AI for essay feedback—yet most educational research treats AI integration as domain-agnostic.
Toshboboyev (2026) identifies this explicitly, noting that while algorithmic determinism risks in education systems are documented, "specific methodologies for measuring and mitigating these challenges are not detailed." The paper further emphasizes that pedagogical vulnerability is likely domain-dependent: mathematics, language, critical thinking, and creative disciplines presumably face different risks from algorithmic dependency, yet no comparative studies establish which are most vulnerable or why.
The challenge is designing controlled studies that isolate AI's cognitive effects in each domain. In mathematics, for example, does repeated AI problem solving strengthen pattern recognition or weaken derivation fluency? For humanities, does AI feedback on writing degrade authorial voice development? Pantskhava, Jishkariani, and Gvilava (2026) acknowledge these concerns in their overview of AI benefits in higher education, but note that "specific methodologies for measuring and mitigating these challenges are not detailed."
A related open question: how do pedagogical outcomes vary across institutional contexts? Docktor (2024) illustrates this gap in physics education, asking pointedly, "How do I convince my colleagues down the hall to care more about their students' experiences in physics?" This reflects a broader fragmentation in educational research: without systematic measurement frameworks, each institution improvises its own assessment approach, preventing cross-institutional synthesis.
What Assessment Protocols Can Detect When AI Enhances Versus Degrades Learning?
The field lacks standardized metrics distinguishing genuine learning gains from false positives. A student using AI essay revision tools might score higher on superficial measures (grammar, readability) while losing depth in argumentation. Measuring this distinction requires multi-dimensional assessment rubrics—a methodological gap that remains unresolved.
Toshboboyev (2026) emphasizes this directly: pedagogical domains must be assessed for "when AI integration degrades natural intellect's independence versus enhancing learning outcomes." The paper calls for "concrete assessment protocols to detect" this boundary, yet provides no ready solution.
This connects to cognitive science literature on "cognitive offloading"—the phenomenon where external tools reduce mental effort but may also reduce skill development. Applied to AI in education: does an AI tutor reducing working memory load accelerate learning or impede automaticity development? The distinction matters but isn't systematically measured in current AI education research.
Syas (2026) demonstrates the challenge empirically. In a study on digital project-based learning for pre-service educators, the author "tested a model of digital project-based learning aimed at developing foreign-language professional skills," finding that "students demonstrated a surge in linguistic, academic, professional, and supra-professional skills." Yet the author acknowledges a critical limitation: "The study was conducted on a small sample size (12 students per group), which limits the generalizability of findings." Even with promising results, the sample size prevents confidently attributing gains to the AI-enhanced pedagogy versus random variation or observer effects.
How Do We Measure Learning Motivation and Behavioral Change?
Beyond cognitive outcomes, AI in education alters student motivation—sometimes positively (personalized pacing, reduced anxiety), sometimes detrimentally (reduced persistence, learned helplessness from over-reliance). Yet motivation measurement remains largely anecdotal in AI education literature.
Konokotin and colleagues (2026) studied subjective positioning and learning motivation among 1,913 primary school students, finding that while "students demonstrate a fairly high level of formation of a subjective position, i.e., activity, initiative, inquisitiveness, and independence in learning," there are "problems with maintaining academic motivation, which tends to gradually decrease by the end of the primary school age." Their data reveal a crucial finding: motivation declines over time independent of AI integration. The question is whether AI tools accelerate or decelerate this decline—a comparison study that hasn't been conducted.
Huang (2025) raises another motivation gap: "How to change the passive learning habits formed by students during high school stage is presented as an unresolved problem requiring further investigation." This suggests AI tools must not only measure learning outcomes but also catalyze behavioral shifts toward active learning. Yet no empirical protocol exists for assessing whether AI does this reliably across different student populations.
How Should Assessment Frameworks Account for Curriculum Heterogeneity?
Educational systems differ dramatically across institutions. What works in a top-tier university may fail at a community college; curricula vary by discipline, institution size, and student demographics. Yet most AI education research treats these variables as confounds rather than studying them directly.
Lopes and colleagues (2026) conducted a study on history of mathematics pedagogy with secondary teachers in Brazil, finding that "teachers recognized the relevance of the History of Mathematics in teaching," yet "boa parte dos professores não a utilizavam em sua prática docente" (many teachers did not use it in their teaching). They also discovered that "the textbook used needs improvement in its approach regarding historical contextualization." The lesson: pedagogical approaches don't transfer automatically; implementation depends on teacher training, textbook quality, and institutional support—none of which are controlled in AI education studies.
Drašar and Šárka (2026) document a similar integration challenge at a specialized institution: while teaching starch chemistry and technology at the University of Chemistry and Technology in Prague, "the curriculum evolution or comparative analysis of how starch production technology instruction has been integrated across different department programs (chemistry, food science, biotechnology)" remains undocumented. "The pedagogical outcomes and competencies students acquire from these courses remain undocumented." This gap applies directly to AI: if specialized domains lack even baseline competency documentation, how can we measure AI's impact?
What Safeguards Prevent Academic Integrity Erosion?
AI's capacity to generate text and solve problems introduces unprecedented academic integrity risks. Yet literature on these risks vastly outpaces literature on mitigation.
Pantskhava, Jishkariani, and Gvilava (2026) highlight the core concern: "Concerns about students using artificial intelligence to complete assignments and violations of academic integrity are mentioned but not thoroughly investigated or addressed with solutions." The gap is not awareness but action—educators recognize the risk but lack assessment protocols to detect misuse or design assignments resilient to AI assistance.
Related is the measurement of optimal AI implementation. The same authors note: "The optimal use of artificial intelligence in higher education is mentioned as important, but frameworks or guidelines for implementation across different institutional contexts remain unclear." This suggests a continuum: from assignments designed to teach with AI to assignments designed despite AI's existence—yet institutions have no map for navigating this space.
Conclusion
The research landscape reveals a consistent pattern: while anecdotes and small-scale studies demonstrate AI's potential in education, the measurement infrastructure to evaluate pedagogical impact systematically remains underdeveloped. Open questions cluster around five dimensions:
- Domain specificity: which pedagogical domains benefit or suffer from AI integration?
- Cognitive measurement: what protocols distinguish learning enhancement from surface gains?
- Motivation and behavior: how does AI affect long-term learning persistence and self-directed learning?
- Implementation context: what factors determine whether AI pedagogies transfer across institutions?
- Integrity and governance: how do we design assessment structures robust to AI-assisted academic dishonesty?
Addressing these questions requires cross-disciplinary collaboration: education researchers designing rigorous measurement, cognitive scientists characterizing learning mechanisms, AI researchers understanding model explanability, and institutions systematizing their own evidence collection. Until then, AI in education remains a high-stakes experiment conducted largely without empirical guardrails.
References
Docktor, J. (2024). "Physics education research at primarily undergraduate institutions." Physics Education Research Conference Proceedings. https://doi.org/10.1119/perc.2024.plenary.docktor
Drašar, P., & Šárka, E. (2026). "History of Starch, Especially in the Czech Lands, and Teaching of Its Production Technology at the University of Chemistry and Technology, Prague." Chemické listy. https://doi.org/10.54779/chl20260123
Huang, L. (2025). "Some Thoughts on the Connection Between College Physics and High School Physics Teaching." Physics and Engineering. https://doi.org/10.26599/phys.2025.9320244
Konokotin, A. V., Lobanova, A. V., Radchikova, N. P., & Sanina, S. P. (2026). "Specificity of subjective position and learning motivation among primary school students." Psychological-Educational Studies. https://doi.org/10.17759/psyedu.202600001
Lopes, M. d. S. R., Sousa, A. F. d., Sousa, D. P. d., Nolêto, S. B., Saraiva, G. N., Sousa, R. N. d., Silva, E. d. M. d. S., Rocha, D. C. d. C., Costa, J. d. S., & Silva, E. B. D. (2026). "História da Matemática na Prática Docente: um estudo com professores de Matemática do ensino médio." REMUNOM. https://doi.org/10.66104/djb2ds60
Pantskhava, E., Jishkariani, M., & Gvilava, T. (2026). "The Benefits of Artificial Intelligence in Higher Education." Georgian Scientists. https://doi.org/10.52340/gs.2026.08.01.25
Syas, Y. (2026). "Project-Based Learning: Cultivating Specialized Language Skills in Pre-Service Educators." Bulletin of Kemerovo State University. Series: Humanities and Social Sciences. https://doi.org/10.21603/2542-1840-2026-10-1-124-136
Toshboboyev, M. (2026). "Sunʼiy va Tabiiy Intellekt Dialektikasining Global Xavfsizlik, Mehnat Bozori, Taʼlim va Madaniyat Sohalarida Mohiyati." Ижтимоий-гуманитар фанларнинг долзарб муаммолари / Актуальные проблемы социально-гуманитарных наук / Actual Problems of Humanities and Social Sciences. https://doi.org/10.47390/spr1342v6si3y2026n07
Related posts
- Measuring What AI Actually Does to Learning: Six Open QuestionsAs AI tools flood classrooms from primary school to postgraduate research, the field lacks standardized protocols to detect whether algorithmic assistance builds or displaces genuine understanding. We map six research gaps drawn from 2024–2026 primary literature.
- Bridging Innovation and Clinical Evidence in Modern HealthcareExploring critical research gaps in translating emerging medical technologies, novel treatments, and AI-assisted diagnostics into clinical practice—from validation studies to real-world implementation challenges.
- Who Was Not in the Study? Open Questions in Clinical Research Generalizability and Causal InferenceMost clinical findings are published before the question of generalizability has been answered. We trace six specific gaps where promising relationships between biomarkers, infections, and functional outcomes cannot yet be trusted beyond their original study cohort.