An in-depth analysis of frontier CLI agents — Claude Code (Opus 4.6), Codex (GPT-5.4), Kimi Code (K2.5) — conducting end-to-end research across diverse fields and compute resources.
- 117 agent-generated papers — 39 per agent (Claude Code, Codex, Kimi Code), 3 trials × 13 seeds, spanning both GPU and CPU domains
- 351 code-aware peer reviews (3 CLI-agent reviewers per paper) + 117 Stanford Agentic Reviewer scores
- Human analysis of every paper, its artifacts, and agentic reviews
➡️ Read the full write-up for scores, per-domain breakdowns, case studies, and the human-inspection findings.
Given a seed field (e.g., "computer vision", "compiler optimization"), each CLI agent follows a standardized pipeline:
- Ideation — Generate a research idea and experiment plan; self-review for up to 3 iterations.
- Experiments — Write and execute code, collect results; self-review for up to 3 iterations.
- Paper Writing — Produce a paper; self-review for up to 3 iterations.
- Review — Evaluate via Stanford Agentic Reviewer and triple peer review (all three agents review each paper alongside its code).
Seeds span multiple CS conferences and two compute platforms. Hardware: 1× RTX A6000 (48GB), 4 CPUs, 60GB RAM (main experiments); H100 (80GB) re-run for GPU seeds.
| Platform | Seeds | Target conferences |
|---|---|---|
| GPU (8) | AI for Biology, Computer Vision, Datasets & Benchmarks, Generative Models, Interpretability, NLP, Privacy in ML, Supervised Representation Learning | ICLR, NeurIPS, ICML, CVPR, ACL, EMNLP |
| CPU (5) | Causal Learning, Compiler Optimization, Data Integration & Cleaning, Operating System Design, Probabilistic Methods | OSDI, SOSP, SIGMOD, VLDB, PLDI, POPL |
ResearchArena/
├── papers/ # 117 agent-generated papers
│ └── {claude,codex,kimi}/{seed}_trial{N}/
│ ├── paper.pdf, paper.tex, references.bib
│ ├── idea.json, plan.json, proposal.md
│ ├── reviews.json # 3 peer reviews
│ ├── stanford_review.json # SAR review
│ └── exp/ # experiment code (.py/.sh)
├── researcharena/ # the benchmark harness
│ ├── stages/ # ideation / experiment / paper / review
│ ├── templates/ # domain guidelines (ml/systems/databases/pl/theory/security)
│ └── utils/ # agent_runner, tracker, checkpoint, …
└── Dockerfile[.cpu] # agent containers
pip install -e .
# Containers (Docker or Podman)
# GPU image: PyTorch 2.6 + CUDA 12.4 + transformers/datasets/accelerate/…
docker build -t researcharena/agent:latest .
# CPU image: Python 3.11 + scipy/sklearn/networkx/sympy/z3-solver/…
docker build -f Dockerfile.cpu -t researcharena/agent-cpu:latest .
# For rootless podman, add --userns=host:
# podman build --userns=host -t researcharena/agent:latest .
# podman build --userns=host -f Dockerfile.cpu -t researcharena/agent-cpu:latest .
# Install the CLI agents (claude, codex, kimi) on the host — they are NOT
# baked into the image. agent_runner.py mounts each binary + its auth
# (~/.claude, ~/.codex, ~/.kimi) into the container at runtime, so log in
# once on the host with `claude login` / `codex login` / `kimi login` and
# you're done.
# Optional: API keys / tokens (forwarded into the container if set)
export ANTHROPIC_API_KEY=sk-ant-... # if not using `claude login`
export OPENAI_API_KEY=sk-... # if not using `codex login`
export MOONSHOT_API_KEY=sk-... # if not using `kimi login`
export HF_TOKEN=hf_... # needed for gated HuggingFace models
export WANDB_API_KEY=... # optional, for experiment loggingresearcharena run --seed "computer vision" --agent claude --platform gpuThat's it — the pipeline handles ideation, experiments, paper writing, and review end-to-end. Swap --agent for codex or kimi, and --platform for cpu to pick a different configuration.
Everything is configurable — swap agents, change self-review intensity, give the agent more ideas to try, or raise the acceptance bar. The main knobs in configs/*.yaml:
| Knob | What it does |
|---|---|
agent.type / agent.model |
Which CLI agent runs the research (claude / codex / kimi / minimax) and which model it uses. |
agent.max_turns, ideation_timeout, paper_timeout |
Per-stage turn and wall-clock budgets for the researcher. |
self_review.max_retries_per_gate |
How many times each gate (idea / experiment / paper) can send itself back for revision. |
self_review.thresholds.{idea,experiment,paper} |
The score each self-review must clear to pass (default: idea 8, experiment 6, paper 8). |
experiment.max_experiment_retries_per_idea |
How many times the agent can retry failed experiments before abandoning an idea. |
pipeline.max_ideas_per_seed |
How many fresh research ideas to try on one seed before giving up. |
review.agents |
Which CLI agents act as peer reviewers, with optional per-reviewer model/timeout. |
review.accept_threshold |
Cutoff for accept vs. revise vs. reject after peer review. |
See configs/8xa6000.yaml for a full annotated example.
We employ a peer-review protocol (PR) where all three CLI agents review every paper alongside its code, logs, and results.json — enabling systematic checks for fabricated or unsupported results. We additionally use the Stanford Agentic Reviewer (SAR) for external, PDF-only validation.
If you find our project useful, please cite:
@misc{researcharena2026,
title = {How Far Are We From True Auto Research?},
author = {Zhang, Zhengxin and Wang, Ning and Galhotra, Sainyam and Cardie, Claire},
year = {2026},
note = {Cornell University. \url{https://youarespecialtome.github.io/ResearchArena/}}
}
Released under the MIT License — see LICENSE.