Towards end-to-end automation of AI research

Context

This note summarizes how far AI agents can automate the full machine learning research lifecycle end to end.
The main focus is an approach that connects idea generation, experimentation, paper writing, and review into one pipeline, rather than automating only isolated stages.

This paper argues that The AI Scientist can run an end-to-end research pipeline spanning research idea generation, experiment implementation/execution, result analysis and visualization, paper writing, and review.
The key idea is not an isolated assistant, but an agentic system that connects multiple stages and produces concrete research artifacts.
The paper reports that one fully AI-generated paper passed first-round peer review at an ICLR workshop, and that newer base models plus more test-time compute improve output quality.

Most AI-for-science systems have automated only individual stages such as literature search, coding, writing, or reviewing, so they have not reached the level of replacing or compressing the full research workflow.
This paper therefore shows why AI research automation should be evaluated as a connected research process rather than as a collection of partial tools.

The full pipeline has four stages: ideation, experimentation, write-up, and review.
In the idea stage, the system builds a research idea archive and uses literature/API search for novelty filtering.
The experimentation stage supports both template-based and template-free modes; the latter uses staged agentic tree search to write, modify, and extend code.
Experiment management separates preliminary investigation, hyperparameter tuning, research agenda execution, and ablation, using parallel tree search over debug, hyperparameter, ablation, replication, and aggregation nodes.
The pipeline then uses a VLM critic for figure-caption alignment and an Automated Reviewer with a NeurIPS-style five-review ensemble plus meta-review.

As the paper acknowledges, the system has not yet reached top-tier conference quality, and even workshop-level quality is not consistent.
In the reported examples, only one of three papers passed the workshop bar; common failure modes include naive ideas, implementation bugs, shallow rigor, duplicated figures, and hallucinated citations.
The submission process included some human filtering of promising outputs, and the current system only applies to computational experiments.
Reviewer automation may suffer from training-data contamination and performance degradation after knowledge cutoff, so it should not be overtrusted.
The paper explicitly names ethical risks such as review overload, credential inflation, idea misattribution, job displacement, and unsafe experiments.

It would be useful to identify which stage is weakest in the AI Scientist end-to-end claim and compare that with follow-up autonomous research systems.
More case comparisons are needed to understand which stage most limits progress from workshop acceptance toward top-tier conference quality.