Dataset card
Science AI Journal Training Corpus
23,000 real peer reviews from 15+ academic platforms. The calibration basis for 8 specialised AI reviewers. CC BY 4.0.
Sources
The corpus is aggregated from public open-review pages across 15+ scholarly platforms. Approximate counts — exact numbers shift as scrapers catch up.
| Platform | Reviews | Focus |
|---|---|---|
| OpenReview | ~7,400 | ML, CS, AI conferences (NeurIPS, ICLR, ACL) |
| eLife | ~2,100 | Life sciences — full-length open review |
| SciPost | ~1,900 | Physics, CS, mathematics |
| PLOS ONE | ~2,300 | Multidisciplinary |
| BMJ Open | ~1,600 | Medicine, public health |
| Nature Communications | ~1,100 | Multidisciplinary (transparent review) |
| F1000Research | ~1,100 | Life sciences + replication |
| Copernicus journals | ~900 | Earth & environmental sciences |
| EMBO Press | ~700 | Molecular biology, genetics |
| MDPI transparent journals | ~1,200 | Multidisciplinary open review |
| Royal Society Open Science | ~800 | Multidisciplinary |
| Other open-review venues | ~1,900 | 15+ smaller journals with public review |
How the corpus maps onto the 8 agents
Reviews are sliced by section (methodology, figures, literature, etc.) and indexed with SQLite FTS5. At review time, each agent pulls 8-40 nearest-match examples as Retrieval-Augmented-Generation context. This is what makes a "methodology" agent behave like a methodology reviewer and not a generalist.
Discipline coverage
Calibration method
On a held-out set of 1,000 papers where the human editorial decision is known, the 8-agent pipeline matches the human decision 83% of the time. A single-prompt monolithic baseline matches 57%. The corpus is the reason for the difference — agents reason against domain-matched examples, not against a generic prior.
We re-run the held-out benchmark monthly and adjust RAG retrieval mix per agent when drift exceeds 5%. Benchmark runs are logged; reach out if you want the raw numbers for a research paper.
Frequently asked questions
- Is the dataset downloadable?
- A summary aggregate is available at /api/dataset/summary as JSON. The full reviews are derived from public open-review pages; we publish parsed JSONL tranches on request for academic collaborators, under the terms of each source platform's licence.
- How is the data used?
- Exclusively for Retrieval-Augmented-Generation (RAG) calibration of the 8-agent review pipeline. We prepend 8-40 real peer-review examples matched by discipline and review-type to every agent prompt so the agent's rubric matches human reviewer expectations.
- Do you re-host reviewer comments that were posted under pseudonyms?
- Only where the source platform's licence and policy explicitly allow it. OpenReview review text is CC BY; eLife and PLOS transparent reviews are CC BY; other sources are handled case-by-case. Where we cannot republish, we extract rubric patterns (not verbatim text) into the calibration corpus.
- Does the dataset include the papers themselves?
- No — the dataset is the reviews, not the manuscripts. For manuscript context we query the relevant open-access paper at runtime from the platform's API.
- How do you keep calibration fresh?
- Scrapers re-run on rolling schedules; a monthly calibration job spot-checks agent outputs against a held-out set of 2,000 reviews and re-balances the RAG mix per agent if drift exceeds 5%.
- Can my institution contribute a tranche?
- Yes — if your journal or conference has transparent review records you want included, email [email protected]. We credit source and preserve licence terms.