Meta AI — korshunov.ai

Lab · Meta AI

Calibration Without Comprehension in LLM Vulnerability Detection

CWE-Trace evaluates eight vanilla and 15 LoRA-fine-tuned LLMs on Linux kernel vulnerability detection. Results show data contamination offers no advantage, and fine-tuning only shifts output thresholds without altering decision policies. Despite improved detection scores, LLMs lack reliable security reasoning, with top-1 CWE accuracy below 1.3% and binary detection performance at 52.1%.

arxiv arXiv cs.LG · 6d ago

LLM Alignment Using Implicit User Feedback

A new dataset, IFLLM, collects mouse trajectories and eye gazing data from users interacting with LLMs. It shows that implicit feedback significantly improves LLM alignment, boosting text-based reward model accuracy from 55% to 64% and nearly tripling response quality improvements after DPO training on eight LLMs.

arxiv arXiv cs.CL · 6d ago

LLM Alignment Using Implicit User Feedback

arxiv arXiv cs.AI · 6d ago

MACR: Explicit Conflict Resolution for LLM Inference

MACR introduces a multi-agent reasoning framework to resolve knowledge conflicts in LLM inference by jointly assessing internal and external knowledge. It uses semantic entropy to measure confidence and employs three specialized agents to induce rules, detect conflicts, and resolve inconsistencies across contexts. Empirical results show MACR outperforms state-of-the-art methods and provides interpretable conflict resolutions.

arxiv arXiv cs.LG · 6d ago

VIMPO: Critic-Free Policy Optimization for LLMs

VIMPO introduces a critic-free policy optimization method that derives a policy-implied value function from KL-regularized reinforcement learning. It enables verifiable reward incorporation without training a critic and outperforms GRPO on mathematical benchmarks, especially under noisy rewards.

arxiv arXiv cs.LG · 6d ago

LLM-based Hierarchical Control in Multi-Agent Games

A hierarchical system using a pretrained LLM to select RL skill policies outperforms flat RL in a 2v2 King of the Hill environment. It matches hand-crafted behavior tree performance in win rate and is perceived as more human-like by 60% of users, highlighting effective coordination and adaptability without manual rule design.

arxiv arXiv cs.CL · 6d ago

Control-Window Law for Single-Neuron Steering in Language Models

A new framework defines when single-neuron interventions coherently control model behaviors without output collapse. The control window, based on alignment and norm ratios, predicts behavior triggers and collapse ceilings using forward pass data, with high accuracy on held-out neurons. On refusal, control is typed: coherent bypass occurs without actionable content, while genuine actionable reach appears only in specific cases and at later rollout stages.

arxiv arXiv cs.CL · 6d ago

Causal Activation Directions for Mitigating Emergent Misalignment in Language Models

Fine-tuning language models on insecure code causes emergent misalignment. A shared activation direction across four model families achieves 99.6% separation of aligned and misaligned activations, and subtracting it reduces code spillover by 21-51 points. Cross-architecture transfer shows behavioral suppression but lacks specificity, with within-model directions being causally actionable and cross-model directions only causally real.

arxiv arXiv cs.LG · 7d ago

Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

Skill-MAS introduces a new approach that decouples experience retention from parametric updates by modeling orchestration as an evolvable Meta-Skill. It uses a closed-loop process involving multi-trajectory rollouts and selective reflection to distill reusable strategy principles, achieving strong performance gains and robust transferability across tasks and LLMs.

arxiv arXiv cs.LG · 7d ago

TAPO: Self-Distillation with Micro-Reflective Trajectories

TAPO advances self-distillation by constructing explicit micro-reflective trajectories that retain erroneous reasoning and insert natural-language diagnoses. These trajectories, derived from correct and incorrect model rollouts, provide fine-grained error corrections anchored in the model's own reasoning, improving both first-pass reasoning and error correction compared to GRPO.

arxiv arXiv cs.LG · 7d ago

REVES: Augmented Training for Test-Time Scaling

REVES introduces a two-stage iterative framework that enhances LLM reasoning through sequential revision and verification. It achieves +6.5 points over RL baselines and +4.0 points over standard multi-turn training on LiveCodeBench, using a 4B base model with fewer rollouts than large evolutionary systems. The method improves error correction and generalizes to out-of-distribution puzzles like n_queens and mini_sudoku.

arxiv arXiv cs.LG · 7d ago

EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts

EfficientRollout introduces a self-speculative decoding framework that reduces rollout and end-to-end latency by up to 19.6% and 12.7% respectively, without compromising final model quality. It uses a quantized drafter derived from the target model and integrates a system-aware toggle policy to avoid compute-bound regimes, enabling effective speculation during evolving policy generations.

arxiv arXiv cs.CL · 7d ago

Misfired Alignment in LLMs: A Quantitative Study

A new study introduces VETO, a benchmark of 2,032 BBQ-derived contrastive pairs, to quantify misfired alignment in large language models. It defines the Misfired Alignment Rate (MAR) and finds that all benchmarked LLMs exhibit MARs between 4.7% and 18.9%, while human participants achieve 0%. The research shows alignment cues can amplify these failures, with evidence suppression occurring in late layers of models and emerging after instruction training.

arxiv arXiv cs.CL · 7d ago

Frustrated Synchronization Network Outperforms Transformers

The Frustrated Synchronization Network (FSN) achieves lower validation loss than a RoPE-SwiGLU transformer at every epoch on character-level text and code tasks. At one million parameters, FSN converges to a validation loss of 1.5953 ± 0.0014, outperforming the transformer's converged loss of 1.611. This advantage persists up to four million parameters, with ongoing evaluations beyond that scale.

arxiv arXiv cs.CL · 7d ago

Output Vector Editing Reduces Memorization in LLMs

A new method called output vector editing minimally modifies MLP neurons' output vectors to suppress memorized sequences in large language models, achieving up to 87.9% suppression in OLMo-7B. This approach outperforms zeroing neuron activations by a factor of 2.7 and works across four models from 36-7B parameters, with success rates scaling with model size and showing consistent performance across architectures.

arxiv arXiv cs.CL · 7d ago

REVES: Augmented Training for Test-Time Scaling

REVES introduces a two-stage iterative framework that enhances large language model reasoning through sequential revision and verification. It achieves +6.5 points over RL baselines and +4.0 points over standard multi-turn training on LiveCodeBench, using a 4B base model with fewer rollouts than larger systems. The method improves error correction and generalizes to out-of-distribution puzzles like n_queens and mini_sudoku.

arxiv arXiv cs.CL · 7d ago

GraphPO: Graph-based Policy Optimization for Reasoning Models

GraphPO introduces a directed acyclic graph framework to represent reasoning rollouts, merging semantically equivalent paths to reduce redundant exploration. It assigns efficiency and correctness advantages to edges, improving inference efficiency and process supervision while reducing advantage-estimation variance. Experiments show GraphPO outperforms chain- and tree-based methods on three LLMs across reasoning and agentic search tasks under identical token or response budgets.

arxiv arXiv cs.AI · 7d ago

Self-Conditioned Credit Assignment for RL with Verifiable Rewards

SC-GRPO uses per-token KL divergence from self-conditioned trajectories to weight gradients in reinforcement learning. It outperforms GRPO by 8.1% and DAPO by 5.9% across math, code, and agentic tasks, with superior out-of-distribution performance and better results than OPD.

arxiv arXiv cs.AI · 7d ago

Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

arxiv arXiv cs.AI · 7d ago

RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents

RODS addresses sample depletion in multi-turn tool-use RL by using reward variance to detect capability boundaries. It synthesizes new data in real time, matching structural complexity of boundary samples, and maintains a dynamic replay buffer that co-evolves with the policy. RODS achieves performance comparable to a 17K-sample offline pipeline with 20x fewer trajectories.