Self-Improvement of Large Language Models: A Technical Overview and Future Outlook

A self-improvement system covering the full lifecycle of autonomous model development — from data acquisition to evaluation.

Haoyan Yang · Mario Xerri · Solha Park · Huajian Zhang · Yiyang Feng · Sai Akhil Kogilathota · Jiawei Zhou

Zesearch NLP Lab · Stony Brook University

Self-Improvement System
A system-level framework covering the full lifecycle of autonomous model development, organized into five key components
01

Data Acquisition

The model autonomously collects or generates raw materials for its own evolution — from static curation to environment interaction to synthetic generation.

02

Data Selection

The model independently evaluates and filters which data points are of higher quality and better suited for its own learning.

03

Model Optimization

The core training stage where the model autonomously converts acquired and selected data into enhanced capabilities — centered on the GRO framework and beyond.

04

Inference Refinement

Improving output quality during inference without permanently updating parameters — spanning decoding strategies, structured reasoning, agentic systems, and test-time adaptation.

05

Autonomous Evaluation

Dynamic benchmarking and interactive environment evaluation enabling self-assessment without human intervention.

Autonomous Evaluation
Accuracy
87%
Stability
62%
Growth
75%
Safety
91%
LLM
Data
Acquisition
Data
Selection
Model
Optimization
Inference
Refinement
Paper List
A comprehensive collection of papers organized by the five framework components
Acquisition Data Acquisition 50 papers

Synthetic Generation

The model uses its intrinsic capabilities to produce entirely new training data — instructions, reasoning chains, or dialogues — through prompting, transformation, or multi-model interaction.

Selection Data Selection 49 papers

Metric-Guided Scoring

Applies predefined scoring metrics derived from model signals (perplexity, influence scores, reward model outputs) to rank and filter data.

Optimization Model Optimization 60 papers

Generation-Reward-Optimization (GRO) Framework

The central paradigm: the model generates candidate outputs (Generation), evaluates them using self-derived or external signals (Reward), and updates its policy accordingly (Optimization).

Within the GRO Framework, many methods share common structural patterns in how they organize generation, reward, and optimization. The paper identifies three representative paradigms that capture these recurring ideas. We mark papers explicitly discussed as exemplars of each paradigm:

Iterative Rejection SamplingThe model generates diverse candidates, filters them via ground truth or majority vote, and fine-tunes on the best outputs.
Self-Verification & RefinementThe model actively evaluates, scores, or refines its own outputs using self-generated reward signals, acting as its own judge.
Self-PlayThe model improves through dynamic interaction between multiple roles, providing an evolving curriculum of challenges.

Theoretical Analysis

Formal theoretical foundations for the GRO loop, including the sharpening mechanism, the generation-verification gap, and convergence conditions.

Inference Inference Refinement 170 papers

Decoding Strategies

Explicitly guides output generation at the token or sequence level to steer the model toward higher-quality outputs.

Reasoning-Based Improvement

Structured reasoning processes including feedback-based reasoning, planning-based reasoning, and collaborative reasoning across agent ensembles.

Agentic System-Based Improvement

Extends inference-time refinement to the system level by dynamically adapting prompts, memory, tool libraries, and workflows.

Evaluation Autonomous Evaluation 25 papers
Challenges & Limitations
Key challenges and open problems constraining self-improving language model systems

Data Autophagy

Models training on their own outputs risk progressive quality degradation, mode collapse, and catastrophic forgetting.

Flawed Feedback Signals

Self-generated reward signals can reinforce errors, amplify biases, and introduce systematic evaluation inconsistencies.

Evaluation Bottlenecks

Static benchmarks are insufficient for measuring iterative improvement — contamination, saturation, and lack of dynamic evaluation remain open problems.

Applications
Self-improvement mechanisms applied across diverse domains, enabling specialized agents to iteratively refine expertise
Future Outlook
Our vision for building scalable and autonomous self-improving systems
01

End-to-End Self-Improving Systems

Moving beyond isolated components toward fully automated loops that continuously generate, evaluate, and refine.

02

Specialized & Application-Centric Models

Domain-specific self-improvement in coding, science, finance, and healthcare for expert-level autonomy.

03

Unified Benchmarks

Standardized evaluation designed to measure iterative improvement, stability, and long-term capability growth.

04

Automation x Human Oversight

Balancing autonomous improvement with human supervision for scalability, safety, and alignment.

Collaboration Welcome

We welcome collaborations and contributions. If you have suggestions, missing papers, or feedback — reach out!