Detecting prior publication across 6 sources in under 12 seconds
How we fan out across CrossRef, Unpaywall, arXiv, medRxiv, bioRxiv, and a local 900K-paper FTS5 index to catch duplicate submissions before they waste reviewer time.
Duplicate submissions are the single cheapest editorial reject we can make, and also the one most prone to false negatives. A paper already living on arXiv under a different title, or translated from a regional journal, looks novel to a tired editor but is trivially discoverable for a machine that's willing to query six databases in parallel.
We built our prior-publication detector to be the first gate any manuscript hits. Here's what's inside it.
Architecture: fan out, 12-second budget
submission
│
├── title + abstract hash
│
├── CrossRef (DOI + title fuzzy)
├── Unpaywall (OA version search)
├── arXiv (preprint server)
├── medRxiv (biomedical preprint)
├── bioRxiv (biology preprint)
└── local library (FTS5 over 900K papers)
│
▼
score + confidence bands → editor
Every upstream call runs with a 12-second timeout. If a source is slow or down, we log the failure and return what we have. The local FTS5 index backstops all remote failures: 900,000 full-text papers harvested from EBSCO EDS, CrossRef, KKU, and Unpaywall, indexed with SQLite's full-text engine.
Why a local index matters
Remote APIs have rate limits, go down, and have coverage gaps. A local
FTS5 index — queried with match on title || abstract — returns a
top-20 candidate set in under 10ms even across 900K rows. We then apply a
word-overlap score with stop-word filtering and flag ≥ 60% overlap as
high-confidence duplicate.
The index rebuilds nightly via library:harvest cron jobs, so new
preprints land in roughly 24 hours after they're indexed upstream.
Fuzzy matching that isn't dumb
Two real failure modes naive title comparison hits:
- Punctuation drift. "Towards efficient…" vs "Towards Efficient…" is
trivial; "COVID-19 outcomes in type-2 diabetics" vs "Covid 19 outcomes
in type 2 diabetics" fools
LOWER(title) = LOWER(title)but overlaps 100%. - Translation shift. A Turkish-language paper reappearing in English will fail title matching but still overlap on methodology + results text when we include abstracts.
Both get caught by token-level Jaccard over stemmed word sets with a custom stop-word list that drops "a, an, the, and, of, in, on, with, for" plus domain fillers ("paper, study, analysis, investigation, approach").
Numbers from the last 30 days
- Submissions screened: 1,412
- Duplicate flags raised: 74 (5.2%)
- False positives after human review: 8 (10.8% of flags)
- False negatives caught by reviewers later: 3
The 10.8% false-positive rate is intentionally high. Editorial workflow routes every high-confidence flag to a 30-second human review before reject — we'd rather err toward false positives than let duplicate work ship.
→ See our full editorial policy · How the review engine works
Related posts
- Peer review in 15 minutes: how Science AI Journal worksAn inside look at our 8-agent review engine, what each agent checks, and why we publish full reports alongside every accepted paper.
- Finding 17,000 research gaps across 250 million papersThe data pipeline behind our gap finder — from OpenAlex ingestion to citation-network deltas, and why most 'AI gap-finders' hallucinate.