A self-improvement system covering the full lifecycle of autonomous model development — from data acquisition to evaluation.
Zesearch NLP Lab · Stony Brook University
The model autonomously collects or generates raw materials for its own evolution — from static curation to environment interaction to synthetic generation.
The model independently evaluates and filters which data points are of higher quality and better suited for its own learning.
The core training stage where the model autonomously converts acquired and selected data into enhanced capabilities — centered on the GRO framework and beyond.
Improving output quality during inference without permanently updating parameters — spanning decoding strategies, structured reasoning, agentic systems, and test-time adaptation.
Dynamic benchmarking and interactive environment evaluation enabling self-assessment without human intervention.
Acquires raw data from fixed, externally hosted sources (web, code, books), where the model acts as an autonomous data-collecting agent.
The model acquires data by actively interacting with external environments — browsing websites, calling APIs, executing code, or operating within simulators.
The model uses its intrinsic capabilities to produce entirely new training data — instructions, reasoning chains, or dialogues — through prompting, transformation, or multi-model interaction.
Applies predefined scoring metrics derived from model signals (perplexity, influence scores, reward model outputs) to rank and filter data.
Introduces a learnable selector that dynamically chooses training data based on the model's evolving state, co-evolving the selection policy alongside the model.
The central paradigm: the model generates candidate outputs (Generation), evaluates them using self-derived or external signals (Reward), and updates its policy accordingly (Optimization).
Within the GRO Framework, many methods share common structural patterns in how they organize generation, reward, and optimization. The paper identifies three representative paradigms that capture these recurring ideas. We mark papers explicitly discussed as exemplars of each paradigm:
Formal theoretical foundations for the GRO loop, including the sharpening mechanism, the generation-verification gap, and convergence conditions.
Model optimization pathways outside the standard GRO framework, including self-referential architectures, agentic self-learning, and open-ended evolutionary approaches.
Explicitly guides output generation at the token or sequence level to steer the model toward higher-quality outputs.
Structured reasoning processes including feedback-based reasoning, planning-based reasoning, and collaborative reasoning across agent ensembles.
Extends inference-time refinement to the system level by dynamically adapting prompts, memory, tool libraries, and workflows.
Adapts model parameters at inference time through self-supervised fine-tuning or reinforcement learning on test inputs.
Continuously updated evaluation that combats data contamination and measures evolving model capabilities over time.
Evaluates models through real-time interaction in complex environments — web, code execution, games, and multi-app ecosystems.
Models training on their own outputs risk progressive quality degradation, mode collapse, and catastrophic forgetting.
Self-generated reward signals can reinforce errors, amplify biases, and introduce systematic evaluation inconsistencies.
Reward hacking, overfitting to proxy objectives, deceptive alignment, and emergent misalignment from self-evolution.
Without external grounding, iterative refinement may fail to converge due to the generation-verification gap and self-bias amplification.
Static benchmarks are insufficient for measuring iterative improvement — contamination, saturation, and lack of dynamic evaluation remain open problems.
Human supervision quality degrades as models scale, and even accurate supervision signals may be ineffective due to alignment faking and controllability limitations.
Moving beyond isolated components toward fully automated loops that continuously generate, evaluate, and refine.
Domain-specific self-improvement in coding, science, finance, and healthcare for expert-level autonomy.
Standardized evaluation designed to measure iterative improvement, stability, and long-term capability growth.
Balancing autonomous improvement with human supervision for scalability, safety, and alignment.
We welcome collaborations and contributions. If you have suggestions, missing papers, or feedback — reach out!