April 19, 20262 min readresearch

Finding 17,000 research gaps across 250 million papers

The data pipeline behind our gap finder — from OpenAlex ingestion to citation-network deltas, and why most 'AI gap-finders' hallucinate.

By Science AI Journal Editorial

Every PhD student eventually stares at the same wall: what's actually missing in my field? The honest answer requires reading five years of abstracts across 10+ sub-disciplines, tracing citation thickets, and spotting where the network gets thin. That's a six-month task, and nobody does it properly.

Our research gap finder pulls that six months down to a 30-second query across 17,000+ gaps derived from the 250-million-paper OpenAlex corpus.

What "gap" means here

We use the term precisely: a gap is a question or methodology cluster where the citation network suggests unmet demand. Three signals combine:

Topic demand — how often a concept is cited relative to how often it's written about. High cite/write ratios are load-bearing concepts starved of fresh work.
Author migration — do senior researchers in adjacent fields keep citing this topic without publishing on it? That's latent attention waiting for an infrastructure paper.
Methodology drift — has the dominant method in a subfield shifted (e.g., random-effects → hierarchical Bayesian) without the older literature being re-analysed under the new method? Each missing re-analysis is a gap.

We don't ask an LLM to identify gaps. LLMs hallucinate gaps the same way they hallucinate citations — confidently and wrong. Instead, the gap detection runs as a deterministic citation-graph computation. The LLM's only job is to write a human-readable summary of each gap after the fact, and those summaries are grounded by quoting the three most-cited papers that surround it.

Why most "AI gap finders" are snake oil

If a tool answers "what gaps exist in [field]" by generating prose from nothing but the field name, it is manufacturing confident fiction. The test is simple: ask the same tool the same question twice. If the gaps drift, it's a hallucination engine. If the gaps are identical and cite specific DOIs you can verify, it's doing graph work underneath.

Our output is reproducible because it's derived from the graph — same corpus, same gaps, same order.

A taste of what the graph found

Preclinical-to-clinical methodology gap in spinal cord injury regeneration — 400+ rodent studies, 11 completed Phase II trials, zero published translation methodology papers.
Replication of mask-study statistical methods under Omicron variants — 80% of the cited literature predates BA.5.
Systematic reviews of retracted papers in oncology — retraction rate in the field has 3x'd since 2018, but the corresponding meta-analyses never got rerun.

Each of these corresponds to a canonical URL on Science AI Journal with the full methodology write-up, top adjacent papers, and a suggested study design. Browse them at /research-gaps.

What this isn't

We won't tell you a gap is worth chasing. Novelty is necessary but not sufficient for good research. We just point at where the network is thin — your judgement is what matters after that.

→ Browse research gaps · Submit a manuscript addressing a gap

#research-gaps#openalex#methodology

Detecting prior publication across 6 sources in under 12 seconds
How we fan out across CrossRef, Unpaywall, arXiv, medRxiv, bioRxiv, and a local 900K-paper FTS5 index to catch duplicate submissions before they waste reviewer time.

What "gap" means here

Why most "AI gap finders" are snake oil

A taste of what the graph found

What this isn't

Related posts

Command palette