Self-Improvement of LLMs — A Technical Overview

Paper List

A comprehensive collection of papers organized by the five framework components

Acquisition Data Acquisition 50 papers

Static Curation

Acquires raw data from fixed, externally hosted sources (web, code, books), where the model acts as an autonomous data-collecting agent.

(2019, Nov) [ACL 2020] S2ORC: The Semantic Scholar Open Research Corpus
(2023, Mar) [NeurIPS 2022] The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
(2023, Sep) [LREC-COLING 2024] CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
(2023, Oct) [ICLR 2024] OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text
(2024, Feb) StarCoder 2 and The Stack v2: The Next Generation
(2024, Mar) [LREC-COLING 2024] A New Massive Multilingual Dataset for High-Performance Language Technologies
(2024, Jun) [NeurIPS 2024] The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
(2024, Jun) [NeurIPS 2024] DataComp-LM: In search of the next generation of training sets for language models
(2025, Feb) [ACL 2025 Findings] Craw4LLM: Efficient Web Crawling for LLM Pretraining

Environment Interaction

The model acquires data by actively interacting with external environments — browsing websites, calling APIs, executing code, or operating within simulators.

(2021, Dec) WebGPT: Browser-assisted question-answering with human feedback
(2023, Feb) [NeurIPS 2023] Toolformer: Language Models Can Teach Themselves to Use Tools
(2023, Apr) [UIST 2023] Generative Agents: Interactive Simulacra of Human Behavior
(2023, Jun) [ICSE 2024] TRACED: Execution-aware Pre-training for Source Code
(2023, Nov) [COLING 2025] ALYMPICS: LLM Agents Meet Game Theory
(2024, Oct) [ICML 2025] RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning
(2025, Jan) Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
(2025, Feb) InSTA: Towards Internet-Scale Training For Agents
(2025, Jun) Go-Browse: Training Web Agents with Structured Exploration
(2025, Sep) Scaling Agents via Continual Pre-training
(2025, Sep) CWM: An Open-Weights LLM for Research on Code Generation with World Models
(2025, Oct) [TMLR] BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions
(2025, Oct) CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment
(2025, Dec) Understanding LLM Agent Behaviours via Game Theory: Strategy Recognition, Biases and Multi-Agent Dynamics
(2026, Jan) EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis

Synthetic Generation

The model uses its intrinsic capabilities to produce entirely new training data — instructions, reasoning chains, or dialogues — through prompting, transformation, or multi-model interaction.

(2018, Jan) Building a Conversational Agent Overnight with Dialogue Self-Play
(2022, Dec) [ACL 2023] Self-Instruct: Aligning Language Models with Self-Generated Instructions
(2023, Apr) [ICLR 2024] WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions
(2023, Apr) [EMNLP 2023] Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data
(2023, May) TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
(2023, Jun) [NeurIPS 2023] Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias
(2023, Jun) Textbooks Are All You Need
(2023, Sep) Textbooks Are All You Need II: phi-1.5 technical report
(2024, Jan) [ACL 2024] Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
(2024, Jan) [ICML 2024] Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
(2024, Mar) Cosmopedia: how to create large-scale synthetic data for pre-training
(2024, Apr) Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
(2024, Apr) [Findings of NAACL 2024] CodecLM: Aligning Language Models with Tailored Synthetic Data
(2024, Jun) [EMNLP 2024] Instruction Pre-Training: Language Models are Supervised Multitask Learners
(2024, Oct) [ICLR 2025] MIND: Math Informed syNthetic Dialogues for Pretraining LLMs
(2024, Oct) [MATH-AI Workshop, NeurIPS 2024] Constraint-Based Synthetic Data Generation for LLM Mathematical Reasoning
(2024, Dec) Phi-4 Technical Report
(2025, Feb) Synthetic Text Generation for Training Large Language Models via Gradient Matching
(2025, Mar) Scaling Laws of Synthetic Data for Language Models
(2025, Jul) CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks
(2025, Jul) [SIGDIAL 2025] Exploring the Design of Multi-Agent LLM Dialogues for Research Ideation
(2025, Aug) ALAS: Autonomous Learning Agent for Self-Updating Language Models
(2025, Aug) [ICLR 2026] R-Zero: Self-Evolving Reasoning LLM from Zero Data
(2025, Sep) Synthetic bootstrapped pretraining
(2025, Sep) Language Self-Play For Data-Free Training
(2025, Nov) [EMNLP 2025] Empowering Math Problem Generation and Reasoning for Large Language Model via Synthetic Data based Continual Learning Framework

Selection Data Selection 49 papers

Metric-Guided Scoring

Applies predefined scoring metrics derived from model signals (perplexity, influence scores, reward model outputs) to rank and filter data.

(2020, Oct) [CIKM 2020] Carpe Diem, Seize the Samples Uncertain "at the Moment" for Adaptive Batch Selection
(2022, Jul) [ICML 2022] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
(2022, Jul) [ICML 2022] Prioritized training on points that are learnable, worth learning, and not yet learnt
(2023, Jul) [ACL 2023 Findings] Data-Efficient Finetuning Using Cross-Task Nearest Neighbors
(2023, Aug) [JMLR 2023] PaLM: Scaling language modeling with pathways
(2023, Dec) [NeurIPS 2023 Workshop: ATTRIB] When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale
(2023, Dec) [NeurIPS 2023] DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
(2023, Dec) [NeurIPS 2023] Data Selection for Language Models via Importance Resampling
(2023, Dec) [EMNLP 2023] Active Instruction Tuning: Improving Cross-Task Generalization by Training on Prompt Sensitive Tasks
(2024, May) [ICLR 2024] AlpaGasus: Training a Better Alpaca with Fewer Data
(2024, May) [ICLR 2024] What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning
(2024, May) [LREC-COLING 2024] IT2ACL Learning Easy-to-Hard Instructions via 2-Phase Automated Curriculum Learning for Large Language Models
(2024, Jun) [NAACL 2024] From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning
(2024, Jul) [ICML 2024] LESS: Selecting Influential Data for Targeted Instruction Tuning
(2024, Aug) [ACL 2024] One-Shot Learning as Instruction Data Prospector for Large Language Models
(2024, Aug) [ACL 2024 Findings] Smaller Language Models are capable of selecting Instruction-Tuning Training Data for Larger Language Models
(2024, Aug) P3: A Policy-Driven, Pace-Adaptive, and Diversity-Promoted Framework for data pruning in LLM Training
(2024, Oct) IterSelectTune: An Iterative Training Framework for Efficient Instruction-Tuning Data Selection
(2024, Dec) [NeurIPS 2024] SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models
(2024, Dec) [NeurIPS 2024] SelectIT: Selective Instruction Tuning for LLMs via Uncertainty-Aware Self-Reflection
(2024, Dec) [NeurIPS 2024] GREATS: Online Selection of High-Quality Data for LLM Training in Every Iteration
(2025, Apr) [ICLR 2025] Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
(2025, Apr) [ICLR 2025] Improving Pretraining Data Using Perplexity Correlations
(2025, Apr) [WWW 2025] Automatic Instruction Data Selection for Large Language Models via Uncertainty-Aware Influence Maximization
(2025, Apr) [ICLR 2025] Automatic Curriculum Expert Iteration for Reliable LLM Reasoning
(2025, Apr) Efficient Reinforcement Finetuning via Adaptive Curriculum Learning
(2025, Apr) Online Difficulty Filtering for Reasoning Oriented Reinforcement Learning
(2025, Jun) [TMLR 2025] Spaced Scheduling for Large Language Model Training
(2025, Jun) LearnAlign: Reasoning Data Selection for Reinforcement Learning in Large Language Models Based on Improved Gradient Alignment
(2025, Nov) [EMNLP 2025] Language Models as Continuous Self-Evolving Data Engineers
(2025, Dec) [NeurIPS 2025] Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning
(2025, Dec) [NeurIPS 2025] T-SHIRT: Token-Selective Hierarchical Data Selection for Instruction Tuning
(2025, Dec) [NeurIPS 2025] Balanced Locality-Sensitive Hashing for Online Data Selection
(2025, Dec) [NeurIPS 2025 Workshop: ER] Efficient Reinforcement Learning for Large Language Models with Intrinsic Exploration
(2025, Dec) [NeurIPS 2025] AdaSTaR: Adaptive Data Sampling for Training Self-Taught Reasoners
(2026, Apr) [ICLR 2026] Influence-Preserving Proxies for Gradient-Based Data Selection in LLM FineTuning

Adaptive Selection

Introduces a learnable selector that dynamically chooses training data based on the model's evolving state, co-evolving the selection policy alongside the model.

(2024, Jun) [ECML PKDD 2025] Active Preference Optimization for Sample Efficient RLHF
(2024, Dec) [NeurIPS 2024] MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models
(2025, Apr) [ICLR 2025] SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection
(2025, Apr) DUMP: Automated Distribution-Level Curriculum Learning for RL-based LLM Post-training
(2025, May) Self-Evolving Curriculum for LLM Reasoning
(2025, Jul) [ACL 2025] Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration
(2025, Jul) [ACL 2025] ScaleBiO: Scalable Bilevel Optimization for LLM Data Reweighting
(2025, Aug) SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models
(2025, Nov) [EMNLP 2025 Findings] RAISE: Reinforced Adaptive Instruction Selection For Large Language Models
(2025, Nov) [EMNLP 2025 Findings] Scale Down to Speed Up: Dynamic Data Selection for Reinforcement Learning
(2025, Dec) [NeurIPS 2025] Group-Level Data Selection for Efficient Pretraining
(2025, Dec) [NeurIPS 2025 Workshop: Reliable ML from Unreliable Data] RL-Guided Data Selection for Language Model Finetuning
(2026, Apr) Joint Selection for Large-Scale Pre-Training Data via Policy Gradient-based Mask Learning

Optimization Model Optimization 60 papers

Generation-Reward-Optimization (GRO) Framework

The central paradigm: the model generates candidate outputs (Generation), evaluates them using self-derived or external signals (Reward), and updates its policy accordingly (Optimization).

Within the GRO Framework, many methods share common structural patterns in how they organize generation, reward, and optimization. The paper identifies three representative paradigms that capture these recurring ideas. We mark papers explicitly discussed as exemplars of each paradigm:

Iterative Rejection SamplingThe model generates diverse candidates, filters them via ground truth or majority vote, and fine-tunes on the best outputs.

Self-Verification & RefinementThe model actively evaluates, scores, or refines its own outputs using self-generated reward signals, acting as its own judge.

Self-PlayThe model improves through dynamic interaction between multiple roles, providing an evolving curriculum of challenges.

(2022, Mar) [NeurIPS 2022] STaR: Bootstrapping Reasoning With Reasoning Iterative Rejection Sampling
(2023, May) [ICLR 2024] Language Model Self-improvement by Reinforcement Learning Contemplation Iterative Rejection Sampling
(2023, Oct) SELF: Self-Evolution with Language Feedback Self-Verification & Refinement
(2023, Dec) [EMNLP 2023] Large Language Models Can Self-Improve Iterative Rejection Sampling
(2024, Jan) [ICML 2024] Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models Self-Play
(2024, May) [ICML 2024] Self-Rewarding Language Models Self-Verification & Refinement
(2024, May) [ICLR 2025] Self-Play Preference Optimization for Language Model Alignment Self-Play
(2024, Jun) [NAACL 2024] Teaching Language Models to Self-Improve through Interactive Demonstrations
(2024, Jun) [NeurIPS 2024] ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search
(2024, Jul) [COLM 2024] V-STaR: Training Verifiers for Self-Taught Reasoners
(2024, Jul) Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge Self-Verification & Refinement
(2024, Aug) [AAAI 2025] Importance Weighting Can Help Large Language Models Self-Improve
(2024, Sep) [NeurIPS 2024] Recursive Introspection: Teaching Language Model Agents How to Self-Improve Self-Verification & Refinement
(2024, Sep) [ICLR 2025] Training Language Models to Self-Correct via Reinforcement Learning
(2024, Oct) [ICLR 2025] ReGenesis: LLMs can Grow into Reasoning Generalists via Self-Improvement Iterative Rejection Sampling
(2024, Oct) [ICLR 2025] Self-Boosting Large Language Models with Synthetic Preference Data
(2024, Nov) [ICML 2025] Self-Consistency Preference Optimization
(2025, Jan) [ICLR 2025] Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains Self-Play
(2025, Feb) [NeurIPS 2025] SiriuS: Self-improving Multi-agent Systems via Bootstrapped Reasoning Self-Play
(2025, Feb) [ICLR 2026 Workshop: AI with Recursive Self-Improvement] Dynamic Noise Preference Optimization: Self-Improvement of Large Language Models with Self-Synthetic Data
(2025, Feb) [COLM 2025] Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving
(2025, Feb) [NeurIPS 2025 Workshop: DL4C] Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation
(2025, Feb) RSPO: Regularized Self-Play Alignment of Large Language Models
(2025, Feb) Self-rewarding correction for mathematical reasoning
(2025, Mar) Self-Taught Self-Correction for Small Language Models
(2025, Mar) Self-Evolved Preference Optimization for Enhancing Mathematical Reasoning in Small Language Models
(2025, Apr) [NAACL 2025] SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains Iterative Rejection Sampling
(2025, May) RLSR: Reinforcement Learning from Self Reward
(2025, May) [EMNLP 2025] DEBATE, TRAIN, EVOLVE: Self Evolution of Language Model Reasoning Self-Play
(2025, May) [NeurIPS 2025] Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards
(2025, May) [NeurIPS 2025] SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data
(2025, May) [NeurIPS 2025] Absolute Zero: Reinforced Self-play Reasoning with Zero Data Self-Play
(2025, May) [NeurIPS 2025] Latent Principle Discovery for Language Model Self-Improvement
(2025, Jun) [NeurIPS 2025] Self-Challenging Language Model Agents
(2025, Jun) PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier
(2025, Jun) [NeurIPS 2025] Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning
(2025, Jun) [ICLR 2026] ReVeal: Self-Evolving Code Agents via Reliable Self-Verification
(2025, Jul) [ACL 2025 Findings] The Self-Improvement Paradox: Can Language Models Bootstrap Reasoning Capabilities without External Scaffolding?
(2025, Jul) [ACL 2025 Findings] Unlocking LLMs’ Self-Improvement Capacity with Autonomous Learning for Domain Adaptation
(2025, Aug) [ICLR 2026] R-Zero: Self-Evolving Reasoning LLM from Zero Data Self-Play
(2025, Sep) Semantic Voting: A Self-Evaluation-Free Approach for Efficient LLM Self-Improvement on Unverifiable Open-ended Tasks
(2025, Oct) [ICLR 2026] RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization
(2025, Oct) SPICE: Self-Play In Corpus Environments Improves Reasoning Self-Play
(2025, Oct) Multi-Agent Evolve: LLM Self-Improve through Co-evolution
(2025, Dec) Toward Training Superintelligent Software Agents through Self-Play SWE-RL
(2026, Jan) Dr. Zero: Self-Evolving Search Agents without Training Data Self-Play

Theoretical Analysis

Formal theoretical foundations for the GRO loop, including the sharpening mechanism, the generation-verification gap, and convergence conditions.

(2024, Oct) [ICLR2025 Workshop: Reasoning and Planning for LLMs] RL-STaR: Theoretical Analysis of Reinforcement Learning Frameworks for Self-Taught Reasoner
(2024, Dec) [ICLR 2025] Self-Improvement in Language Models: The Sharpening Mechanism
(2024, Dec) [ICLR 2025] Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models
(2025, May) Can Large Reasoning Models Self-Train?
(2025, Jun) [ICLR 2026] Theoretical Modeling of Large Language Model Self-Improvement Training Dynamics Through Solver-Verifier Gap
(2026, Feb) Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain

Beyond GRO

Model optimization pathways outside the standard GRO framework, including self-referential architectures, agentic self-learning, and open-ended evolutionary approaches.

(2024, Jun) [NeurIPS 2024] Discovering Preference Optimization Algorithms with and for Large Language Models
(2025, May) [NeurIPS 2025] Iterative Self-Incentivization Empowers Large Language Models as Agentic Searchers
(2025, May) [ICLR 2026] Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
(2025, Jul) [ACL 2025] Gödel Agent: A Self-Referential Agent Framework for Recursively Self-Improvement
(2025, Oct) Towards Agentic Self-Learning LLMs in Search Environment
(2025, Oct) EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle
(2025, Nov) Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning
(2025, Dec) Reinforcement Learning for Self-Improving Agent with Skill Library

Inference Inference Refinement 170 papers

Decoding Strategies

Explicitly guides output generation at the token or sequence level to steer the model toward higher-quality outputs.

(2020, Sep) [NeurIPS 2020] Learning to summarize from human feedback
(2021, Oct) Training Verifiers to Solve Math Word Problems
(2021, Nov) [EMNLP 2021] PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models
(2022, May) [ACL 2022] Dynatask: A Framework for Creating Dynamic AI Benchmark Tasks
(2022, Jul) [NAACL 2022] NeuroLogic A*esque Decoding: Constrained Text Generation with Lookahead Heuristics
(2022, Nov) [ICML 2023] Fast Inference from Transformers via Speculative Decoding
(2022, Dec) [NeurIPS 2022] NaturalProver: Grounded Mathematical Proof Generation with Language Models
(2023, Feb) [ICLR 2023] Self-Consistency Improves Chain of Thought Reasoning in Language Models
(2023, Feb) Accelerating Large Language Model Decoding with Speculative Sampling
(2023, Jul) [ACL 2023] LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion
(2023, Jul) [ACL 2023] Contrastive Decoding: Open-ended Text Generation as Optimization
(2023, Jul) [ACL 2023] Accelerating Transformer Inference for Translation via Parallel Decoding
(2023, Jul) [ICLR 2024] Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation
(2023, Aug) [AAAI 2024] Graph of Thoughts: Solving Elaborate Problems with Large Language Models
(2023, Sep) [NeurIPS 2023] Self-Evaluation Guided Beam Search for Reasoning
(2023, Sep) Contrastive Decoding Improves Reasoning in Large Language Models
(2023, Dec) [NeurIPS 2023] Tree of Thoughts: Deliberate Problem Solving with Large Language Models
(2023, Dec) [EMNLP 2023] Penalty Decoding: Well Suppress the Self-Reinforcement Effect in Open-Ended Text Generation
(2023, Dec) [EMNLP 2023 Findings] Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation
(2024, Jan) [ICLR 2024] ARGS: Alignment as Reward-Guided Search
(2024, Jan) [ICLR 2024] DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models
(2024, Feb) [Nature 2024] Solving olympiad geometry without human demonstrations
(2024, Feb) [ICML 2024] Decoding-time Realignment of Language Models
(2024, Mar) [EACL 2024] Small Language Models Improve Giants by Rewriting Their Outputs
(2024, May) [ICML 2024] AlphaZero-Like Tree-Search can Guide Large Language Model Decoding and Training
(2024, May) [ICML 2024] Controlled Decoding from Language Models
(2024, May) [ICML 2024] Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
(2024, May) [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
(2024, Jun) [ICML 2024 Workshop: In-Context Learning] Universal Self-Consistency for Large Language Models
(2024, Jun) Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B
(2024, Jul) [COLM 2024] Deductive Beam Search: Decoding Deducible Rationale for Chain-of-Thought Reasoning
(2024, Jul) [COLM 2024] Don't throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding
(2024, Jul) [COLM 2024] Pack of LLMs: Model Fusion at Test-Time via Perplexity Optimization
(2024, Aug) [ACL 2024 Findings] Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism
(2024, Sep) [NeurIPS 2024] Decoding-Time Language Model Alignment with Multiple Objectives
(2024, Nov) [ACL 2024 Findings] Inference-Time Language Model Alignment via Integrated Value Guidance
(2025, Jan) [ICLR 2025] Mixture-of-Agents Enhances Large Language Model Capabilities
(2025, Jan) [ICLR 2025] PAD: Personalized Alignment of LLMs at Decoding-time
(2025, Mar) [ICLR 2025 Workshop: MCDC] Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers (Abridged
(2025, Mar) [ICIC 2025] Uncertainty-Guided Chain-of-Thought for Code Generation with LLMs
(2025, Apr) PRM-BAS: Enhancing Multimodal Reasoning through PRM-guided Beam Annealing Search
(2025, May) [ICML 2025] Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning
(2025, Jul) [ACL 2025 Findings] Confidence Improves Self-Consistency in LLMs
(2025, Jul) [ACL 2025] CER: Confidence Enhanced Reasoning in LLMs
(2025, Jul) [ACL 2025] CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis
(2025, Jul) [ACL 2025] DeAL: Decoding-time Alignment for Large Language Models
(2025, Jul) [ACL 2025 Findings] ALW: Adaptive Layer-Wise contrastive decoding enhancing reasoning ability in Large Language Models
(2025, Jul) [AAAI 2026 Workshop: WoMAPF] Parallelism Meets Adaptiveness: Scalable Documents Understanding in Multi-Agent LLM Systems
(2025, Aug) Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs
(2025, Sep) [NeurIPS 2025] Scalable Best-of-N Selection for Large Language Models via Self-Certainty
(2025, Oct) Adaptive Blockwise Search: Inference-Time Alignment for Large Language Models
(2025, Oct) [ICLR 2025] GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment

Reasoning-Based Improvement

Structured reasoning processes including feedback-based reasoning, planning-based reasoning, and collaborative reasoning across agent ensembles.

(2022, Sep) [CoRL 2022] Inner Monologue: Embodied Reasoning through Planning with Language Models
(2023, Feb) [ICML 2023] LEVER: Learning to Verify Language-to-Code Generation with Execution
(2023, Feb) [ICLR 2023] Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
(2023, Feb) [ICLR 2023] ReAct: Synergizing Reasoning and Acting in Language Models
(2023, Apr) LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
(2023, Jul) [ACL 2023] Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models
(2023, Aug) [AAAI 2024] Graph of Thoughts: Solving Elaborate Problems with Large Language Models
(2023, Sep) [NeurIPS 2023] Reflexion: language agents with verbal reinforcement learning
(2023, Sep) [NeurIPS 2023] Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning
(2023, Sep) [NeurIPS 2023] AdaPlanner: Adaptive Planning from Feedback with Language Models
(2023, Dec) [NeurIPS 2023] Self-Refine: Iterative Refinement with Self-Feedback
(2023, Dec) [NeurIPS 2023] Language Models can Solve Computer Tasks
(2023, Dec) [NeurIPS 2023] Describe, Explain, Plan and Select: Interactive Planning with LLMs Enables Open-World Multi-Task Agents
(2023, Dec) [NeurIPS 2023] Tree of Thoughts: Deliberate Problem Solving with Large Language Models
(2023, Dec) [NeurIPS 2023] CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
(2024, Jan) [ICLR 2024] Teaching Large Language Models to Self-Debug
(2024, Jan) [ICLR 2024] CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
(2024, Jan) [ICLR 2024] MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
(2024, Feb) Confidence Matters: Revisiting Intrinsic Self-Correction Capabilities of Large Language Models
(2024, Feb) [IEEE Robotics and Automation Letters 2024] Text2Reaction : Enabling Reactive Task Planning Using Large Language Models
(2024, Mar) [EACL 2024] REFINER: Reasoning Feedback on Intermediate Representations
(2024, Aug) [ACL 2024 Findings] Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step by Step
(2024, Aug) [ACL 2024] ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs
(2024, Aug) [ACL 2024] Rethinking the Bounds of LLM Reasoning: Are Multi-Agent Discussions the Key?
(2024, Sep) [NeurIPS 2024] Recursive Introspection: Teaching Language Model Agents How to Self-Improve
(2024, Sep) [ICML 2024] Improving Factuality and Reasoning in Language Models through Multiagent Debate
(2024, Sep) [AAMAS 2026] GroupDebate: Enhancing the Efficiency of Multi-Agent Debate Using Group Discussion
(2024, Nov) [EMNLP 2024 Findings] Learning to Refine with Fine-Grained Natural Language Feedback
(2024, Nov) [EMNLP 2024] Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
(2024, Nov) [EMNLP 2024 Findings] Improving Multi-Agent Debate with Sparse Communication Topology
(2025, Feb) Iterative Deepening Sampling as Efficient Test-Time Scaling
(2025, Feb) Self-rewarding correction for mathematical reasoning
(2025, Mar) Dancing with Critiques: Enhancing LLM Reasoning with Stepwise Natural Language Self-Critique
(2025, May) [ICML 2025] AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement
(2025, May) [ICML 2025] Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback
(2025, Oct) Iterative Critique-Refine Framework for Enhancing LLM Personalization

Agentic System-Based Improvement

Extends inference-time refinement to the system level by dynamically adapting prompts, memory, tool libraries, and workflows.

(2022, Dec) [EMNLP 2022] RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning
(2023, Feb) [ICLR 2023] Large Language Models are Human-Level Prompt Engineers
(2023, Feb) [ICLR 2023] ReAct: Synergizing Reasoning and Acting in Language Models
(2023, Jul) [ACL 2023 Findings] Better Zero-Shot Reasoning with Self-Adaptive Prompting
(2023, Oct) MemGPT: Towards LLMs as Operating Systems
(2023, Oct) [NeurIPS 2023 Workshop: DistShift] LLM Routing with Benchmark Datasets
(2023, Dec) [EMNLP 2023] Universal Self-Adaptive Prompting
(2023, Dec) [EMNLP 2023] Automatic Prompt Optimization with “Gradient Descent” and Beam Search
(2023, Dec) [EMNLP 2023 Findings] CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models
(2023, Dec) [NeurIPS 2023] ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings
(2024, Jan) [ICLR 2024] Query-Dependent Prompt Evaluation and Optimization with Offline Inverse RL
(2024, Jan) [ICLR 2024] Large Language Models as Optimizers
(2024, Jan) [ICLR 2024] Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization
(2024, Jan) [ICLR 2024] PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization
(2024, Jan) [ICLR 2024] Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers
(2024, Jan) [ICLR 2024] CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets
(2024, Jan) [ICLR 2024] Large Language Models as Tool Makers
(2024, Jan) [ICLR 2024] ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
(2024, Jan) [ICLR 2024] ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search
(2024, Mar) [AAAI 2024] MemoryBank: Enhancing Large Language Models with Long-Term Memory
(2024, Mar) [TMLR 2024] Voyager: An Open-Ended Embodied Agent with Large Language Models
(2024, May) [ICML 2024] Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution
(2024, Jun) [NAACL 2024] Neurocache: Efficient Vector Retrieval for Long-range Language Modeling
(2024, Jun) [NAACL 2024] Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models
(2024, Jul) [ICML 2024] GPTSwarm: language agents as optimizable graphs
(2024, Jul) [COLM 2024] A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration
(2025, Jan) LLM-AutoDiff: Auto-Differentiate Any LLM Workflow
(2025, Jan) [ICLR 2025] ToolGen: Unified Tool Retrieval and Calling via Generation
(2025, Jan) [ICLR 2025] Self-Evolving Multi-Agent Collaboration Networks for Software Development
(2025, Jan) [ICLR 2025] Automated Design of Agentic Systems
(2025, Jan) [ICLR 2025] AFlow: Automating Agentic Workflow Generation
(2025, Jan) [ICLR 2025] AgentSquare: Automatic LLM Agent Search in Modular Design Space
(2025, Feb) ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization
(2025, Mar) [Nature] Optimizing generative AI by backpropagating language model feedback
(2025, Mar) [ACL 2025 Findings] PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play
(2025, Mar) Chain-of-Tools: Utilizing Massive Unseen Tools in the CoT Reasoning of Frozen Language Models
(2025, Apr) Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
(2025, Apr) FlowReasoner: Reinforcing Query-Level Meta-Agents
(2025, Apr) [NAACL 2025] EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms
(2025, May) RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation
(2025, May) [ICML 2025] MAS-GPT: Training LLMs to Build LLM-based Multi-Agent Systems
(2025, May) Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
(2025, Jun) AlphaEvolve: A coding agent for scientific and algorithmic discovery
(2025, Jun) Adaptive Graph Pruning for Multi-Agent Communication
(2025, Jun) Agentic Neural Networks: Self-Evolving Multi-Agent Systems via Textual Backpropagation
(2025, Jul) MIRIX: Multi-Agent Memory System for LLM-Based Agents
(2025, Jul) [AAAI 2025] Assemble Your Crew: Automatic Multi-agent Communication Topology Design via Autoregressive Graph Generation
(2025, Aug) Optimizing Prompt Sequences using Monte Carlo Tree Search for LLM-Based Optimization
(2025, Aug) Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning
(2025, Sep) [NeurIPS 2025] A-Mem: Agentic Memory for LLM Agents
(2025, Sep) [NeurIPS 2025] G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems
(2025, Sep) [NeurIPS 2025] Multi-Agent Collaboration via Evolving Orchestration
(2025, Sep) [NeurIPS 2025] AgentNet: Decentralized Evolutionary Coordination for LLM-based Multi-Agent Systems
(2025, Sep) [NeurIPS 2025] MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision
(2025, Oct) [NeurIPS 2025 Workshop: ER] ProRefine: Inference-time Prompt Refinement with Textual Feedback
(2025, Oct) AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents
(2025, Oct) Alita-G: Self-Evolving Generative Agent for Agent Generation
(2025, Oct) ToolLibGen: Scalable Automatic Tool Creation and Aggregation for LLM Reasoning
(2025, Nov) [AAAI 2026] AutoTool: Efficient Tool Selection for Large Language Model Agents
(2025, Nov) [EMNLP 2025] AMAS: Adaptively Determining Communication Topology for LLM-based Multi-agent System
(2025, Nov) [EMNLP 2025] EvoAgentX: An Automated Framework for Evolving Agentic Workflows
(2025, Dec) Hindsight is 20/20: Building Agent Memory that Retains, Recalls, and Reflects
(2026, Jan) [LaMAS 2026] Self-evolving Agents with reflective and memory-augmented abilities
(2026, Jan) [ICLR 2026] MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents
(2026, Jan) [ICLR 2026] MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
(2026, Jan) Beyond Static Tools: Test-Time Tool Evolution for Scientific Reasoning
(2026, Jan) Alita: Generalist Agent Enabling Scalable Agentic Reasoning with Minimal Predefinition and Maximal Self-Evolution
(2026, Jan) [ICLR 2026] Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
(2026, Feb) Evolving from Tool User to Creator via Training-Free Experience Reuse in Multimodal Reasoning
(2026, Feb) Evolving from Tool User to Creator via Training-Free Experience Reuse in Multimodal Reasoning
(2026, Mar) [ICLR 2026 Workshop: MemAgents] Learning to Continually Learn via Meta-learning Agentic Memory Designs

Test-Time Training

Adapts model parameters at inference time through self-supervised fine-tuning or reinforcement learning on test inputs.

(2020, Jul) [ICML 2020] Test-Time Training with Self-Supervision for Generalization under Distribution Shifts
(2024, Jan) [ICLR 2024] Test-Time Training on Nearest Neighbors for Large Language Models
(2025, Mar) LADDER: Self-Improving LLMs Through Recursive Problem Decomposition
(2025, May) [ICML 2025] The Surprising Effectiveness of Test-Time Training for Few-Shot Learning
(2025, Jul) [ACL 2025] Learning to Reason from Feedback at Test-Time
(2025, Aug) [AAAI 2025] Test-Time Reinforcement Learning for GUI Grounding via Region Consistency
(2025, Aug) ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism
(2025, Sep) [NeurIPS 2025 Workshop: CCFM] Continuous Self-Improvement of Large Language Models by Test-time Training with Verifier-Driven Sample Selection
(2025, Sep) [NeurIPS 2025] Self-Adapting Language Models
(2025, Sep) [NeurIPS 2025] TTRL: Test-Time Reinforcement Learning
(2025, Dec) End-to-End Test-Time Training for Long Context

Evaluation Autonomous Evaluation 25 papers

Dynamic Benchmarking

Continuously updated evaluation that combats data contamination and measures evolving model capabilities over time.

(2023, Jun) KoLA: Carefully Benchmarking World Knowledge of Large Language Models
(2023, Dec) [NeurIPS 2023] RealTime QA: What's the Answer Right Now?
(2025, Jan) [ICLR 2025] LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
(2025, Jan) [ICLR 2025] LiveBench: A Challenging, Contamination-Limited LLM Benchmark
(2025, Apr) TDBench: A Benchmark for Top-Down Image Understanding with Reliability Analysis of Vision-Language Models
(2025, Jul) [ACL 2025] AntiLeakBench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge
(2025, Jul) [SIGIR 2025] Dynamic-KGQA: A Scalable Framework for Generating Adaptive Question Answering Datasets
(2025, Jul) [ICML 2025] DyCodeEval: Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination
(2025, Jul) [ACL Findings 2025] DynaQuest: A Dynamic Question Answering Dataset Reflecting Real-World Knowledge Updates
(2025, Aug) DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis
(2025, Oct) [TMLR] AcademicEval: Live Long-Context LLM Benchmark

Interactive Environment Evaluation

Evaluates models through real-time interaction in complex environments — web, code execution, games, and multi-app ecosystems.

(2018, Jul) TextWorld: A Learning Environment for Text-Based Games
(2020, Feb) [AAAI 2020] Interactive Fiction Games: A Colossal Adventure
(2022, Feb) Formal Mathematics Statement Curriculum Learning
(2022, Jul) WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
(2023, Jul) WebArena: A Realistic Web Environment for Building Autonomous Agents
(2024, Apr) OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
(2024, May) AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
(2024, Jun) AgentGym: Evolving Large Language Model-based Agents across Diverse Environments
(2024, Aug) [ACL 2024] AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
(2024, Sep) Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
(2024, Oct) AppBench: Planning of Multiple APIs from Various APPs for Complex User Instruction
(2025, Mar) SafeArena: Evaluating the Safety of Autonomous Web Agents
(2025, Apr) [ICLR 2025] GameArena: Evaluating LLM Reasoning through Live Computer Games
(2025, Jul) [ICML 2025] LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

Self-Improvement of Large Language Models: A Technical Overview and Future Outlook

Data Acquisition

Data Selection

Model Optimization

Inference Refinement

Autonomous Evaluation

Static Curation

Environment Interaction

Synthetic Generation

Metric-Guided Scoring

Adaptive Selection

Generation-Reward-Optimization (GRO) Framework

Theoretical Analysis

Beyond GRO

Decoding Strategies

Reasoning-Based Improvement

Agentic System-Based Improvement

Test-Time Training

Dynamic Benchmarking

Interactive Environment Evaluation

Data Autophagy

Flawed Feedback Signals

Optimization-Driven Failures

Ineffective Self-Refinement

Evaluation Bottlenecks

Supervision Bottlenecks

End-to-End Self-Improving Systems

Specialized & Application-Centric Models

Unified Benchmarks

Automation x Human Oversight

Collaboration Welcome