2 min readengineering

Detecting prior publication across 6 sources in under 12 seconds

How we fan out across CrossRef, Unpaywall, arXiv, medRxiv, bioRxiv, and a local 900K-paper FTS5 index to catch duplicate submissions before they waste reviewer time.

By Science AI Journal Editorial

Duplicate submissions are the single cheapest editorial reject we can make, and also the one most prone to false negatives. A paper already living on arXiv under a different title, or translated from a regional journal, looks novel to a tired editor but is trivially discoverable for a machine that's willing to query six databases in parallel.

We built our prior-publication detector to be the first gate any manuscript hits. Here's what's inside it.

Architecture: fan out, 12-second budget

submission
    │
    ├── title + abstract hash
    │
    ├── CrossRef       (DOI + title fuzzy)
    ├── Unpaywall      (OA version search)
    ├── arXiv          (preprint server)
    ├── medRxiv        (biomedical preprint)
    ├── bioRxiv        (biology preprint)
    └── local library  (FTS5 over 900K papers)
          │
          ▼
     score + confidence bands → editor

Every upstream call runs with a 12-second timeout. If a source is slow or down, we log the failure and return what we have. The local FTS5 index backstops all remote failures: 900,000 full-text papers harvested from EBSCO EDS, CrossRef, KKU, and Unpaywall, indexed with SQLite's full-text engine.

Why a local index matters

Remote APIs have rate limits, go down, and have coverage gaps. A local FTS5 index — queried with match on title || abstract — returns a top-20 candidate set in under 10ms even across 900K rows. We then apply a word-overlap score with stop-word filtering and flag ≥ 60% overlap as high-confidence duplicate.

The index rebuilds nightly via library:harvest cron jobs, so new preprints land in roughly 24 hours after they're indexed upstream.

Fuzzy matching that isn't dumb

Two real failure modes naive title comparison hits:

  1. Punctuation drift. "Towards efficient…" vs "Towards Efficient…" is trivial; "COVID-19 outcomes in type-2 diabetics" vs "Covid 19 outcomes in type 2 diabetics" fools LOWER(title) = LOWER(title) but overlaps 100%.
  2. Translation shift. A Turkish-language paper reappearing in English will fail title matching but still overlap on methodology + results text when we include abstracts.

Both get caught by token-level Jaccard over stemmed word sets with a custom stop-word list that drops "a, an, the, and, of, in, on, with, for" plus domain fillers ("paper, study, analysis, investigation, approach").

Numbers from the last 30 days

  • Submissions screened: 1,412
  • Duplicate flags raised: 74 (5.2%)
  • False positives after human review: 8 (10.8% of flags)
  • False negatives caught by reviewers later: 3

The 10.8% false-positive rate is intentionally high. Editorial workflow routes every high-confidence flag to a 30-second human review before reject — we'd rather err toward false positives than let duplicate work ship.

See our full editorial policy · How the review engine works

#prior-publication#openalex#engineering

Related posts

Command palette

Jump anywhere, run any action.