Topic · Reasoning models
arxiv arXiv cs.AI · 6d ago

CRAX: Fast Safe Reinforcement Learning Benchmarking

CRAX introduces a high-fidelity, accelerated safety benchmark for reinforcement learning using MuJoCo XLA. It achieves up to 100x speedups over CPU-based benchmarks via vectorization and hardware acceleration, featuring six environment suites and three agent-specific tasks across three difficulty levels. Evaluation of six safe RL methods shows no single approach dominates, highlighting trade-offs between performance and safety, with curriculum learning and safety transfer improving results.

arxiv arXiv cs.LG · 6d ago

CRAX: Fast Safe Reinforcement Learning Benchmarking

CRAX introduces a high-fidelity, fast safety benchmark for reinforcement learning using MuJoCo XLA. It achieves up to 100x speedups over CPU-based benchmarks via vectorization and hardware acceleration, featuring six environment suites and three agent-specific tasks across three difficulty levels. Evaluation of six safe RL methods shows no single approach dominates, highlighting trade-offs between performance and safety, with curriculum learning and safety transfer improving results.

arxiv arXiv cs.CL · 6d ago

Selective Verification for Budget-Aware Reasoning

Sevra, a serving-layer controller, selectively verifies answers to improve accuracy and reduce token usage. On \mathfive, it achieves 76.3% accuracy with 26.8% fewer post-generation tokens and halved harmful flips, while on \gsm it verifies only 3.0% of examples, boosting accuracy to 94.5% and cutting verification tokens by 91.2%. The study shows that initial solve length and explicit control needs determine optimal verification strategy.

arxiv arXiv cs.CL · 6d ago

Control-Window Law for Single-Neuron Steering in Language Models

A new framework defines when single-neuron interventions coherently control model behaviors without output collapse. The control window, based on alignment and norm ratios, predicts behavior triggers and collapse ceilings using forward pass data, with high accuracy on held-out neurons. On refusal, control is typed: coherent bypass occurs without actionable content, while genuine actionable reach appears only in specific cases and at later rollout stages.

arxiv arXiv cs.CL · 6d ago

Causal Activation Directions for Mitigating Emergent Misalignment in Language Models

Fine-tuning language models on insecure code causes emergent misalignment. A shared activation direction across four model families achieves 99.6% separation of aligned and misaligned activations, and subtracting it reduces code spillover by 21-51 points. Cross-architecture transfer shows behavioral suppression but lacks specificity, with within-model directions being causally actionable and cross-model directions only causally real.

arxiv arXiv cs.LG · 7d ago

Discriminator-Guided RL Corrects Flow Matching with Data-Aligned Rewards

Discriminator-Guided RL (DRL) uses a pretrained representation space to train a discriminator that separates real data from model-generated samples. Its logit is used as a reward in KL-regularized RL, aligning model outputs with visual and semantic realism without human preferences. DRL improves FID and semantic FD across models like SiT and JiT, and enhances the Pareto frontier between preference and fidelity.

arxiv arXiv cs.LG · 7d ago

MAST Enables Selective Unlearning in RLVR-Induced Reasoning

MAST, a mechanism-guided unlearning method, achieves targeted forgetting of RLVR-induced reasoning with minimal collateral damage. On Qwen2.5-Math-1.5B and Qwen3-1.7B-Base, it significantly reduces MATH performance (45/150 to 37/15-0) while preserving GSM8K accuracy by +0.8 points and maintaining MATH retention at -0.5 points. Results hold across different seeds, objectives, and models, showing superior stability over full-parameter unlearning.

arxiv arXiv cs.LG · 7d ago

Diffusion-Proof: First Framework for Diffusion LLMs in Formal Theorem Proving

Diffusion-Proof is the first framework to train and apply diffusion language models for formal theorem proving. It introduces dLLM-Prover-7B for whole-proof writing with long-range coherence and dLLM-Corrector-7- for local proof correction using bidirectional information. The framework outperforms auto-regressive LLM baselines by 1.61% on ProofNet-Test and 6.14% on MiniF2F-Test, and solves an IMO problem beyond the capability of DeepSeek-Prover-V2-7B.