Allen AI — korshunov.ai

Lab · Allen AI

STARE addresses policy entropy collapse in GRPO-based reinforcement learning by identifying entropy-critical token subsets via surprisal quantiles and reweighting their advantages. It maintains stable policy entropy across model scales and tasks, outperforming DAPO and other baselines by 4%-8% on AIME24 and AIME25, with consistent exploration-exploitation balance.

arxiv arXiv cs.LG · 7d ago

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

OneCanvas enables 3D scene understanding in Vision-Language Models by aggregating patch features onto a single panoramic canvas using 3D world coordinates. It achieves state-of-the-art performance on SQA3D and VSI-Bench, and generalizes to out-of-distribution data on SPBench, using significantly less training compute than existing methods.

arxiv arXiv cs.AI · 7d ago

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

arxiv arXiv cs.AI · 7d ago

Rubric-Conditioned Self-Distillation Framework

Rubric-Conditioned Self-Distillation introduces a framework that uses structured rubrics to provide fine-grained, token-level feedback during self-distillation of reasoning language models. By conditioning teacher models on rubric-level criteria, it enables more precise credit assignment than scalar rewards, outperforming GRPO and OPSD by 1.0 and 0.9 points on average across science reasoning benchmarks.

arxiv arXiv cs.CL · 7d ago

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

arxiv arXiv cs.CL · 7d ago

Rubric-Conditioned Self-Distillation Framework

arxiv arXiv cs.LG · 9d ago

ROVE: Reinforcement Learning with Human Interventions for Humanoid Manipulation

ROVE enables humanoid Vision-Language-Action models to learn effective manipulation behaviors using imperfect human interventions. It combines a human-in-the-loop data collection pipeline with Optimistic Value Estimation and cross-embodiment supervision to prioritize high-value actions and improve robustness. ROVE outperforms baseline methods on real-world, contact-rich manipulation tasks through iterative rollout and intervention cycles.

arxiv arXiv cs.LG · 9d ago

HABC Improves RL Fine-Tuning of VLAs with Sparse Outcomes

Hierarchical Advantage-Weighted Behavior Cloning (HABC) enhances online RL fine-tuning of vision-language agents by using separate critic heads for viability and efficiency. It combines their outputs via a state-adaptive gate and applies per-transition weights, while intervention-aware credit assignment prevents supervision leakage. In real-robot experiments, HABC boosts success rates to 92%, 88%, and 38% on three bimanual tasks, surpassing SFT baselines of 36%, 44%, and 12%.

arxiv arXiv cs.AI · 6d ago

RACL: Reasoning-Agent Control Layer for Metaheuristic Learning

RACL introduces a reasoning agent that controls metaheuristic search behavior without replacing optimizers or altering constraints. It improves or ties key policies in vehicle routing experiments, reducing average cost by 8.337% versus Fixed and 1.605% versus Stagnation-Triggered policies, with no significant computational overhead.

arxiv arXiv cs.AI · 6d ago

Hybrid ANN-SNN Pipeline with Local Plasticity

A hybrid ANN-SNN pipeline uses pretrained EfficientNet encoders and converts their activations to spike trains via rate-coding. The system trains a CoLaNET spiking classifier with local plasticity rules, achieving 99.09% accuracy on ImageNet's 64-class benchmark, matching conventional deep networks.

arxiv arXiv cs.LG · 6d ago

Training LLMs for Long-Lifecycle Agents via Cross-Domain Generalization

A new framework enables large language models to develop 'Connect the Dots' capability, allowing long-lifecycle agents to learn from experiences and iteratively update their environment context. The framework uses reinforcement learning with long rollout sequences and custom tasks to promote cross-domain generalization, showing effective out-of-distribution performance in both domains and transition settings.

arxiv arXiv cs.LG · 6d ago

Information-Theoretic Analysis of Effective Supervision in Latent Chain-of-Thought

This paper identifies a dual collapse in latent reasoning: gradient attenuation and representational drift. It proposes Trajectory and Space Supervision, showing that generative reconstruction preserves information capacity better than geometric compression. The Unified Latent Probe measures mutual information between latent trajectories and reasoning steps, revealing an information-performance binding in reasoning accuracy.

arxiv arXiv cs.CL · 6d ago

Semantic Clusters Pre-Train Tsetlin Machine for Interpretability

A new framework pre-trains the Tsetlin Machine using semantic clusters from language models, avoiding embeddings. The method groups text samples into coherent clusters via K-means or Top2Vec, then uses cluster-sample pairs to train a non-negated TM with Type I feedback. Results show superior performance across five datasets, matching BERT-level accuracy while maintaining full interpretability.

arxiv arXiv cs.CL · 6d ago

Training LLMs for Long-Lifecycle Agents via Cross-Domain Generalization

A new framework enables large language models to learn 'Connect the Dots' by using reinforcement learning with long rollout sequences. The method includes tailored tasks and environments to foster meta-capability development, showing strong cross-domain generalization and performance in out-of-distribution settings. Implementations are available at https://github.com/agentscope-ai/Trinity-RFT/tree/research/cod/examples/research_cod.

arxiv arXiv cs.CL · 6d ago

Information-Theoretic Analysis of Effective Supervision in Latent Chain-of-Thought

This work identifies a dual collapse in latent reasoning: gradient attenuation and representational drift. It proposes Trajectory and Space Supervision, showing that generative reconstruction preserves information capacity better than geometric compression. The Unified Latent Probe measures mutual information between latent trajectories and reasoning steps, revealing an information-performance binding in reasoning accuracy.

arxiv arXiv cs.LG · 7d ago

Act2Answer Evaluates Knowledge Retention in Vision-Language-Action Models

Act2Answer introduces a lightweight protocol to assess commonsense and world knowledge retention in VLA models by requiring agents to answer questions through object placement actions. A large-scale study of 7 VLA models and 9 VLM baselines reveals that VLAs perform well on simple concepts but show larger gaps on rich semantic categories compared to their source VLMs, with VQA co-training improving knowledge retention and peak answer-relevant signals observed in middle VLA layers.

arxiv arXiv cs.AI · 7d ago

Reinforcement Learning Foundation Models Should Already Be A Thing

Reinforcement learning lacks foundation models despite synthetic MDPs being feasible. A proof-of-concept shows a single model trained on synthetic MDPs solves tabular benchmarks without tuning, outperforming existing methods in online settings and matching them offline.

arxiv arXiv cs.AI · 7d ago

Xcientist: Externalizing Research Synthesis and Validation in AI Scientists

Xcientist introduces a research harness that externalizes and makes visible the reasoning processes in AI scientists. It preserves traceable, contract-governed research artifacts from problem formulation to validation, addressing claim drift and ensuring scientific accountability.

arxiv arXiv cs.AI · 7d ago

Technical Taxonomy of LLM Agent Communication Protocols

A new taxonomy classifies LLM agent communication protocols across five dimensions: counterparty, payload, interaction state, discovery mechanism, and schema flexibility. Analysis shows hybrid payloads, session-state persistence, and runtime schema negotiation are common, with decentralized discovery remaining rare. The study predicts short-term convergence toward unified agent-to-agent and agent-to-context protocols, and long-term evolution toward a federated, layered protocol stack.

arxiv arXiv cs.LG · 8d ago

Deep Reinforcement Learning for Minimum Zero-Forcing Sets

This paper proposes SD-ZFS, a deep reinforcement learning framework adapted from S2V-DQN, to solve the NP-hard minimum zero-forcing set problem on undirected graphs. The framework demonstrates strong performance compared to optimal solutions and greedy heuristics, showing effective generalization, scalability, and transfer across diverse graph structures.