8 min readresearch-gaps

Measuring What AI Actually Does to Learning: Six Open Questions

As AI tools flood classrooms from primary school to postgraduate research, the field lacks standardized protocols to detect whether algorithmic assistance builds or displaces genuine understanding. We map six research gaps drawn from 2024–2026 primary literature.

By Science AI Journal Editorial

Artificial intelligence has entered education faster than educational research can track it. Within the span of a few years, generative AI tools have moved from banned novelties to officially sanctioned learning aids at institutions ranging from primary schools to doctoral programs. Policy has followed adoption rather than preceded it, and the empirical literature has largely done the same: researchers are publishing findings about AI-assisted learning as it exists today, but the foundational measurement questions — does it actually work, for whom, over what timescale, and at what hidden cost — remain largely unanswered.

The problem is not that researchers are unaware of the gaps. A recurring feature of the 2025–2026 literature on AI in education is explicit acknowledgment of what was not studied. Assessment protocols for cognitive impact are deferred to future work. Longitudinal designs that would reveal whether early AI adoption builds or erodes independent capability are absent. Institutional frameworks for responsible implementation are described as urgently needed while remaining absent from the papers that call for them.

This post draws on six specific research gaps identified in recently published primary literature to map where the measurement deficit is most acute. The questions are not rhetorical. Each has a concrete answer that would meaningfully change how educators, institutions, and policymakers should behave — and none of them currently has that answer.

Which pedagogical domains are most vulnerable to algorithmic dependency?

The concern that AI integration could produce what some researchers call "algorithmic determinism" — a learned helplessness in which students route cognitive work through AI rather than developing autonomous reasoning — is increasingly present in the literature. Toshboboyev (2026) examines the dialectics of artificial and natural intelligence across education, labor markets, and culture, raising the risk that AI embeds in educational systems in ways that degrade rather than enhance independent intellect (10.47390/spr1342v6si3y2026n07).

What that analysis does not provide is domain specificity. The question of whether algorithmic dependency risks are uniform across subjects, or whether certain domains are structurally more vulnerable, has not been empirically investigated. There are reasons to expect variation. Mathematics, where AI can produce fully correct step-by-step solutions, presents a different dependency risk than studio art, where AI output is recognizably stylistic rather than procedurally correct. Language learning sits somewhere between: AI can produce fluent text, but fluency is precisely what language instruction aims to develop as an internal capacity.

The open question is which domains produce the conditions — AI capability high, detection difficulty high, cognitive shortcut tempting — that most reliably generate dependency rather than augmentation. Without domain-specific vulnerability maps, blanket AI policies will continue to over-restrict some contexts while under-protecting others.

Does AI assistance deepen or displace cognitive development?

Closely related to domain vulnerability is the question of what AI does to cognition over time. The concern, flagged by Toshboboyev (2026) citing Carr's work on internet-mediated cognition, is that AI assistance may weaken the depth of reflective thinking — not by making students less intelligent, but by reducing the effortful processing that consolidates knowledge and builds metacognitive awareness (10.47390/spr1342v6si3y2026n07).

This is a testable hypothesis that has not been tested. What the literature lacks is comparative longitudinal studies tracking cognitive outcomes — comprehension depth, metacognitive awareness, critical reasoning performance — between cohorts that learn with and without AI assistance across different educational levels. The methodological requirements are demanding: matched groups, validated cognitive assessments at multiple time points, and a study duration long enough to observe divergence. None of the AI-in-education papers in the current literature meets all three criteria simultaneously.

The absence is consequential. If AI assistance accelerates surface learning while degrading deep processing, the effect may be invisible for months or years — student performance on low-level assessments may look fine while the capacity for independent analysis quietly atrophies. By the time the damage is detectable in outcomes like graduate-level research performance or professional judgment, the cohort has already moved through the system.

How do institutions actually verify academic integrity when AI completes assignments?

Academic integrity in the AI era is the most loudly discussed and least operationally resolved issue in educational technology. Pantskhava, Jishkariani, and Gvilava (2026) document the concern that students are using AI to complete assignments and note that academic integrity violations represent a significant unresolved challenge for institutions integrating AI into higher education (10.52340/gs.2026.08.01.25).

What the paper does not provide — and what the broader literature fails to supply — is an empirically validated detection or prevention framework. AI text detectors are commercially available but have documented false-positive rates that disproportionately flag non-native English speakers and produce false confidence when students use AI for ideation rather than verbatim generation. Process-based verification (requiring drafts, sources, and reasoning traces) shifts assessment design substantially without resolving whether the submitted process evidence is itself AI-assisted.

The core measurement problem is that academic integrity violations are, by design, concealed. Survey-based prevalence estimates are unreliable, controlled experiments face ecological validity problems, and institutional detection data is underreported. A field-deployable protocol that institutions of varying resource levels could implement consistently — with known sensitivity, specificity, and false-positive rates — does not exist. Building one is harder than calling for one, which is perhaps why the calls continue to accumulate without producing the protocol.

What implementation frameworks can actually transfer across institutional contexts?

A related gap in the Pantskhava et al. (2026) analysis is the absence of transferable implementation guidance. The paper identifies optimal AI use in higher education as important and the absence of cross-institutional frameworks as a limitation, but the field broadly has not converged on any validated framework (10.52340/gs.2026.08.01.25).

The implementation challenge is heterogeneous. A research-intensive university with strong digital infrastructure and graduate-heavy enrollment faces entirely different integration questions than a regional undergraduate institution with limited IT support, high proportions of first-generation students, and significant variation in faculty digital literacy. Frameworks developed in one context and transplanted to another without validation have a poor track record across educational technology waves.

What is needed is a comparative implementation study that follows institutions with different profiles through AI integration decisions over multiple academic years, measuring not just whether faculty adopted AI tools but what they actually did with them, how student learning outcomes changed, and which institutional factors predicted successful versus unsuccessful integration. The research would be expensive and slow, which is precisely why industry-adjacent rapid-deployment reports have filled the vacuum with less rigorous evidence.

Can the effects of digital project-based learning be trusted at scale?

Syas (2026) reports a carefully designed intervention using digital project-based learning to develop specialized language skills in pre-service educators, finding positive results for foreign-language competency — but the study was conducted on twelve students per group, and the paper explicitly acknowledges that the small sample size limits the generalizability of its findings (10.21603/2542-1840-2026-10-1-124-136).

This limitation is not unique to that paper. Small-sample, single-institution studies dominate the educational technology literature for structural reasons: they are faster and cheaper to conduct, ethical approval is more straightforward, and researchers have access to their own classrooms. The result is a literature rich in promising signals that cannot be relied upon because replication at scale has not been attempted.

The replication problem is compounded when AI is involved, because AI tool capabilities change on timescales shorter than traditional research cycles. A positive finding about GPT-4 in writing instruction published in early 2025 may not generalize to a classroom in 2026 using substantially different tools. Establishing what pedagogical mechanisms, rather than specific AI capabilities, drive positive outcomes would require a research design deliberately abstracted from particular tools — and such designs are rare.

Konokotin, Lobanova, Radchikova, and Sanina (2026) encountered a parallel generalizability constraint in motivational research on primary school students: their study of subjective position and learning motivation explicitly notes that findings require verification within other didactic systems beyond the traditional educational framework in which the study was conducted (10.17759/psyedu.202600001). The structural problem of findings that are locally valid but not yet generalized is endemic across educational research, and AI integration intensifies it.

How do passive learning habits formed before AI adoption interact with AI dependency?

A less-discussed thread in the educational literature concerns the passive learning habits that students arrive with, prior to any AI integration. HUANG Liyuan (2025) identifies the persistence of passive habits formed during high school as an unresolved problem in college physics instruction — students who learned to receive information rather than construct understanding do not automatically become active learners when they change institutions (10.26599/phys.2025.9320244).

The interaction between pre-existing passivity and AI availability has not been studied, but the logic is suggestive. A student who already defaults to receiving answers rather than generating them encounters in generative AI a tool that rewards precisely that disposition. The question is whether AI in education functions as a compensatory scaffold that helps passive learners access content they would otherwise miss, or as an accelerant that deepens passivity by removing the friction that would otherwise force engagement.

Docktor (2024), surveying physics education research at primarily undergraduate institutions, raises the harder version of this question: how do faculty who are skeptical about pedagogical innovation get moved toward practices that evidence supports? The answer, she acknowledges, is not settled (10.1119/perc.2024.plenary.docktor). If the baseline pedagogical environment remains one in which passive reception is the norm, then measuring AI's incremental effect requires first characterizing the baseline — and most AI-in-education studies do not include a validated baseline assessment of prior learning habits.


The six questions above share a common structure: they are empirically tractable, they have concrete answers that would change practice, and none of them currently have those answers. That is the definition of a productive research gap, and it suggests that the most valuable contribution the next generation of educational AI research could make is less "AI was beneficial in context X" and more "here is a validated protocol for measuring whether AI is beneficial in any context." Without that measurement infrastructure, educators are navigating consequential decisions about technology adoption in the dark.

#artificial-intelligence#education#pedagogy#learning-outcomes#research-gaps

Related posts

Command palette

Jump anywhere, run any action.