How Far Are We From True Auto Research?

An in-depth analysis of frontier CLI agents — Claude Code (Opus 4.6), Codex (GPT-5.4), Kimi Code (K2.5) — conducting end-to-end research across diverse fields and compute resources.

117 agent-generated papers — 39 per agent (Claude Code, Codex, Kimi Code), 3 trials × 13 seeds, spanning both GPU and CPU domains
351 code-aware peer reviews (3 CLI-agent reviewers per paper) + 117 Stanford Agentic Reviewer scores
Human analysis of every paper, its artifacts, and agentic reviews

➡️ Read the full write-up for scores, per-domain breakdowns, case studies, and the human-inspection findings.

What this does

Given a seed field (e.g., "computer vision", "compiler optimization"), each CLI agent follows a standardized pipeline:

Ideation — Generate a research idea and experiment plan; self-review for up to 3 iterations.
Experiments — Write and execute code, collect results; self-review for up to 3 iterations.
Paper Writing — Produce a paper; self-review for up to 3 iterations.
Review — Evaluate via Stanford Agentic Reviewer and triple peer review (all three agents review each paper alongside its code).

Conferences & areas

Seeds span multiple CS conferences and two compute platforms. Hardware: 1× RTX A6000 (48GB), 4 CPUs, 60GB RAM (main experiments); H100 (80GB) re-run for GPU seeds.

Platform	Seeds	Target conferences
GPU (8)	AI for Biology, Computer Vision, Datasets & Benchmarks, Generative Models, Interpretability, NLP, Privacy in ML, Supervised Representation Learning	ICLR, NeurIPS, ICML, CVPR, ACL, EMNLP
CPU (5)	Causal Learning, Compiler Optimization, Data Integration & Cleaning, Operating System Design, Probabilistic Methods	OSDI, SOSP, SIGMOD, VLDB, PLDI, POPL

Repo structure

ResearchArena/
├── papers/                     # 117 agent-generated papers
│   └── {claude,codex,kimi}/{seed}_trial{N}/
│       ├── paper.pdf, paper.tex, references.bib
│       ├── idea.json, plan.json, proposal.md
│       ├── reviews.json              # 3 peer reviews
│       ├── stanford_review.json      # SAR review
│       └── exp/                      # experiment code (.py/.sh)
├── researcharena/              # the benchmark harness
│   ├── stages/                 # ideation / experiment / paper / review
│   ├── templates/              # domain guidelines (ml/systems/databases/pl/theory/security)
│   └── utils/                  # agent_runner, tracker, checkpoint, …
└── Dockerfile[.cpu]            # agent containers

Setup

pip install -e .

# Containers (Docker or Podman)
# GPU image: PyTorch 2.6 + CUDA 12.4 + transformers/datasets/accelerate/…
docker build -t researcharena/agent:latest .
# CPU image: Python 3.11 + scipy/sklearn/networkx/sympy/z3-solver/…
docker build -f Dockerfile.cpu -t researcharena/agent-cpu:latest .

# For rootless podman, add --userns=host:
#   podman build --userns=host -t researcharena/agent:latest .
#   podman build --userns=host -f Dockerfile.cpu -t researcharena/agent-cpu:latest .

# Install the CLI agents (claude, codex, kimi) on the host — they are NOT
# baked into the image. agent_runner.py mounts each binary + its auth
# (~/.claude, ~/.codex, ~/.kimi) into the container at runtime, so log in
# once on the host with `claude login` / `codex login` / `kimi login` and
# you're done.

# Optional: API keys / tokens (forwarded into the container if set)
export ANTHROPIC_API_KEY=sk-ant-...      # if not using `claude login`
export OPENAI_API_KEY=sk-...             # if not using `codex login`
export MOONSHOT_API_KEY=sk-...           # if not using `kimi login`
export HF_TOKEN=hf_...                   # needed for gated HuggingFace models
export WANDB_API_KEY=...                 # optional, for experiment logging

Usage

researcharena run --seed "computer vision" --agent claude --platform gpu

That's it — the pipeline handles ideation, experiments, paper writing, and review end-to-end. Swap --agent for codex or kimi, and --platform for cpu to pick a different configuration.

Everything is configurable — swap agents, change self-review intensity, give the agent more ideas to try, or raise the acceptance bar. The main knobs in configs/*.yaml:

Knob	What it does
`agent.type` / `agent.model`	Which CLI agent runs the research (claude / codex / kimi / minimax) and which model it uses.
`agent.max_turns`, `ideation_timeout`, `paper_timeout`	Per-stage turn and wall-clock budgets for the researcher.
`self_review.max_retries_per_gate`	How many times each gate (idea / experiment / paper) can send itself back for revision.
`self_review.thresholds.{idea,experiment,paper}`	The score each self-review must clear to pass (default: idea 8, experiment 6, paper 8).
`experiment.max_experiment_retries_per_idea`	How many times the agent can retry failed experiments before abandoning an idea.
`pipeline.max_ideas_per_seed`	How many fresh research ideas to try on one seed before giving up.
`review.agents`	Which CLI agents act as peer reviewers, with optional per-reviewer model/timeout.
`review.accept_threshold`	Cutoff for accept vs. revise vs. reject after peer review.

See configs/8xa6000.yaml for a full annotated example.

Review pipeline

We employ a peer-review protocol (PR) where all three CLI agents review every paper alongside its code, logs, and results.json — enabling systematic checks for fabricated or unsupported results. We additionally use the Stanford Agentic Reviewer (SAR) for external, PDF-only validation.

Citation

If you find our project useful, please cite:

@misc{researcharena2026,
  title   = {How Far Are We From True Auto Research?},
  author  = {Zhang, Zhengxin and Wang, Ning and Galhotra, Sainyam and Cardie, Claire},
  year    = {2026},
  note    = {Cornell University. \url{https://youarespecialtome.github.io/ResearchArena/}}
}

License

Released under the MIT License — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 438 Commits
configs		configs
papers		papers
researcharena		researcharena
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.cpu		Dockerfile.cpu
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How Far Are We From True Auto Research?

What this does

Conferences & areas

Repo structure

Setup

Usage

Review pipeline

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

How Far Are We From True Auto Research?

What this does

Conferences & areas

Repo structure

Setup

Usage

Review pipeline

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages