Dataset card

Science AI Journal Training Corpus

23,000 real peer reviews from 15+ academic platforms. The calibration basis for 8 specialised AI reviewers. CC BY 4.0.

23,000
peer reviews
15+
source platforms
10+
disciplines
CC BY 4.0
licence

Sources

The corpus is aggregated from public open-review pages across 15+ scholarly platforms. Approximate counts — exact numbers shift as scrapers catch up.

PlatformReviewsFocus
OpenReview~7,400ML, CS, AI conferences (NeurIPS, ICLR, ACL)
eLife~2,100Life sciences — full-length open review
SciPost~1,900Physics, CS, mathematics
PLOS ONE~2,300Multidisciplinary
BMJ Open~1,600Medicine, public health
Nature Communications~1,100Multidisciplinary (transparent review)
F1000Research~1,100Life sciences + replication
Copernicus journals~900Earth & environmental sciences
EMBO Press~700Molecular biology, genetics
MDPI transparent journals~1,200Multidisciplinary open review
Royal Society Open Science~800Multidisciplinary
Other open-review venues~1,90015+ smaller journals with public review

How the corpus maps onto the 8 agents

Reviews are sliced by section (methodology, figures, literature, etc.) and indexed with SQLite FTS5. At review time, each agent pulls 8-40 nearest-match examples as Retrieval-Augmented-Generation context. This is what makes a "methodology" agent behave like a methodology reviewer and not a generalist.

Methodology
Audits study design, statistical power, and analytical choices against field-specific rigour standards (CONSORT, STROBE, PRISMA).
Formulas & Equations
Verifies mathematical derivations, checks dimensional analysis, and flags algebraic errors.
Originality
Surfaces overlap with prior work across CrossRef, arXiv, medRxiv, bioRxiv, Unpaywall, and an institutional library of 900,000+ papers.
Literature Coverage
Evaluates citation completeness against OpenAlex's 250M+ scholarly works.
Reproducibility
Inspects code availability, dataset accessibility, and sufficiency of methods detail for independent replication.
Clarity & Language
Assesses readability, structural flow, and adherence to scholarly writing norms.
Figures & Tables
Checks figure quality, caption completeness, and appropriateness of visual encodings.
Prior Publication
Fans out in parallel to six external sources to detect duplicate submissions and predatory overlap.

Discipline coverage

EngineeringLife SciencesPhysical SciencesComputer ScienceMedicineMathematicsEnvironmental ScienceSocial ScienceHumanitiesEconomics

Calibration method

On a held-out set of 1,000 papers where the human editorial decision is known, the 8-agent pipeline matches the human decision 83% of the time. A single-prompt monolithic baseline matches 57%. The corpus is the reason for the difference — agents reason against domain-matched examples, not against a generic prior.

We re-run the held-out benchmark monthly and adjust RAG retrieval mix per agent when drift exceeds 5%. Benchmark runs are logged; reach out if you want the raw numbers for a research paper.

Frequently asked questions

Is the dataset downloadable?
A summary aggregate is available at /api/dataset/summary as JSON. The full reviews are derived from public open-review pages; we publish parsed JSONL tranches on request for academic collaborators, under the terms of each source platform's licence.
How is the data used?
Exclusively for Retrieval-Augmented-Generation (RAG) calibration of the 8-agent review pipeline. We prepend 8-40 real peer-review examples matched by discipline and review-type to every agent prompt so the agent's rubric matches human reviewer expectations.
Do you re-host reviewer comments that were posted under pseudonyms?
Only where the source platform's licence and policy explicitly allow it. OpenReview review text is CC BY; eLife and PLOS transparent reviews are CC BY; other sources are handled case-by-case. Where we cannot republish, we extract rubric patterns (not verbatim text) into the calibration corpus.
Does the dataset include the papers themselves?
No — the dataset is the reviews, not the manuscripts. For manuscript context we query the relevant open-access paper at runtime from the platform's API.
How do you keep calibration fresh?
Scrapers re-run on rolling schedules; a monthly calibration job spot-checks agent outputs against a held-out set of 2,000 reviews and re-balances the RAG mix per agent if drift exceeds 5%.
Can my institution contribute a tranche?
Yes — if your journal or conference has transparent review records you want included, email [email protected]. We credit source and preserve licence terms.
Dataset summary (JSON)How the agents use itResearch collaboration

Command palette

Jump anywhere, run any action.