All articles — korshunov.ai

All articles Page 1 / 90

TRACE: Lightweight Detection of Corpus Poisoning in RAG via Token Influence Attribution

Retrieval-Augmented Generation systems face significant risks from corpus poisoning attacks that manipulate outputs through malicious documents. Existing detection methods often require auxiliary classifiers or additional LLM verification, which introduces substantial computational overhead. To address this, researchers introduced TRACE, a lightweight framework that identifies poisoning by tracing answer-related tokens via influence attribution. The system first discovers recurrent high-influence keywords across retrieved documents to flag potential threats. It then performs secondary verification to confirm the specific influence of these tokens on model predictions. Experiments conducted on three QA benchmarks and six LLMs demonstrate strong detection performance for the framework. Additionally, TRACE successfully uncovers attacker-specified target answers during the verification process.

arxiv arXiv cs.CL · 5h ago

RAS: Measuring LLM Safety Through Refusal Alignment

The authors propose SafeVec, a white-box evaluation procedure that measures LLM safety using internal representations instead of generated outputs. This method extracts layer-wise refusal directions from a safety-aligned reference model to identify stable layers where safe and unsafe behaviors are separable. It then scores target models by checking if their hidden states align with these refusal directions during unsafe prompts. The resulting metric, RAS (Refusal Alignment Score), maps this alignment to a calibrated 0-100 safety score. Experiments across Llama, Gemma, and Qwen families show RAS effectively separates aligned models from uncensored variants. Additionally, the metric tracks output-level attack success rates while being substantially faster than judge-based evaluations. These findings suggest refusal alignment offers a compact and efficient signal for white-box safety assessment.

arxiv arXiv cs.CL · 5h ago

OPERA: Aligning Open-Ended Reasoning via Objective Perplexity-based Reinforcement Learning

The OPERA framework addresses the instability of applying reinforcement learning to open-ended tasks by replacing external judge models with intrinsic rewards derived from perplexity dynamics. This approach quantifies uncertainty reduction at critical reflective states, eliminating stylistic biases and positional inconsistencies common in LLM-as-a-judge systems. During the cold-start phase, the method utilizes guiding words to synthesize diverse reasoning traces and employs perplexity-prioritized rollouts to identify logically consistent branches. This pipeline generates a large-scale dataset of 20,000 high-quality reasoning trajectories for training. Implementing OPERA on the Qwen3-8B model establishes a new state-of-the-art among open-source models. The system achieves parity with or surpasses proprietary models like Gemini2.5 and MiniMax-M2.5 in specific open-ended tasks. Empirical evaluations confirm the scalability and efficacy of this objective perplexity-based alignment strategy.

arxiv arXiv cs.CL · 5h ago

Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation

This study evaluates whether fine-tuned ModernBERT encoder classifiers can serve as cost-effective alternatives to LLM-based judges for safety evaluation. The researchers benchmarked ModernBERT and Ettin against rule-based prefix matching, fine-tuned LLM classifiers, and various LLM judge methodologies. These LLM judges included strategies from StrongReject, ShieldGemma, JailbreakBench, AILuminate, SorryBench, Claude-as-a-judge, and models like LlamaGuard 3 and 4. The encoder classifiers were trained on judge-labeled data using a majority-voting label strategy and tested on a gold-standard holdout dataset. Performance was measured using F1 score, false negative rate, and precision-recall metrics across open-source adversarial datasets. Results were further analyzed by attack technique, including single-turn prompting, decomposition, escalation, and context manipulation. The findings provide guidance on when encoder classifiers can reliably replace LLM-based judges without substantial performance loss.

media Hugging Face Forums · 6h ago

Niodoo: A Local Runtime for Hidden State Steering of Frozen LLMs

Jason Van Pham has released Niodoo, a local runtime designed to steer frozen large language models through their hidden states. The project aims to correct last-step errors by injecting noise or "physics forces" during inference to break token loops. This approach allows smaller models to improve performance without fine-tuning, targeting specific failure cases like the Llama strawberry prompt benchmark. The system generates its own telemetry tags and utilizes TDA analysis to monitor internal model states for looping behavior. Van Pham developed this tool organically through months of self-directed research and red-teaming, emphasizing reproducible results from pinned hashes. The code is available on GitHub under the repository Ruffian-L/niodoo-hidden-state-steering.

media Hugging Face Forums · 6h ago

User Reports Tool and MCP Server Unavailability for Step 3.7 Flash on HuggingChat

A user on the Hugging Face forums reported that the Step 3.7 Flash model lost the ability to use tools and connect to MCP servers starting that morning. The poster expressed strong satisfaction with the model's performance, noting its high quality relative to its low resource consumption and cost. They emphasized a desire to continue using this specific model rather than switching to alternatives due to its efficiency. The user explicitly asked whether this loss of functionality is permanent and if there are any steps they can take to restore access. The post highlights community concern regarding the sudden disruption of tooling capabilities for a popular, cost-effective model.

media Hugging Face Forums · 6h ago

Prompt Format Inquiry for Training Unsloth/Phi-3.5-mini-instruct

A user seeks advice on the optimal prompt formatting strategy for training the Phi-3.5-mini-instruct model using Unsloth. The inquiry contrasts maintaining a custom text format against utilizing a standard chat template for dataset preparation. The current implementation employs a function that structures data into '### Input:' and '### Output:' sections, appending an end-of-text token. This approach processes JSON-encoded input and output fields derived from a Hugging Face Dataset object. The provided example illustrates a complex structure involving financial insights, merchant names, dates, and transaction totals. The user intends to deploy the trained model via a custom API and requests guidance on whether to retain this format or switch to a chat template.

github llama.cpp · 6h ago

llama.cpp b9785 Release with Hardened Caps Check and Multi-Platform Binaries

The llama.cpp project has released version b9785, featuring a code change to harden caps checks as detailed in pull request #24973. This update provides pre-built binaries for macOS Apple Silicon, Intel Macs, and iOS via an XCFramework, with KleidiAI support disabled on Apple Silicon. Linux distributions including Ubuntu are supported for CPU, Vulkan, ROCm 7.2, OpenVINO, and SYCL backends across x64, arm64, and s390x architectures. Android users can access arm64 CPU binaries, while Windows offers extensive options covering CPU, OpenCL Adreno, CUDA 12 and 13, Vulkan, OpenVINO, SYCL, and HIP. The release also includes builds for openEuler targeting x86 and aarch64 processors with ACL Graph support. A standalone UI package is available alongside the platform-specific releases to facilitate local model inference.

arxiv arXiv cs.CL · 6h ago

Argus Benchmark Evaluates Uncertainty Quantification Stability Across Vision-Language Models and GUI Grounding Datasets

The authors introduce Argus, a benchmark designed to evaluate post-hoc uncertainty quantification for computer-use agents that translate vision-language model predictions into executable GUI actions. The study assesses 28 open-weight methods across four VLM agents and four datasets, alongside eight closed-source methods from three vendors where internal model states are inaccessible. Key findings reveal selective transfer stability, where uncertainty rankings remain consistent across different datasets for a fixed model but degrade significantly when moving between different model classes or observable interfaces. Among open-weight options, hidden-state and density estimation techniques demonstrated the highest stability, while specific regimes favored sampling-based scores or verbalized self-assessment. Within-model ranking transfer proved strong with Spearman rho values up to 0.969, whereas cross-tier transfer to closed-source vendors averaged only +0.08. The research further indicates that conformal click regions shrink radii by 40-60 percent upon calibration but suffer coverage degradation under interface mismatch. To support regime-aware selection, the authors release per-item records, calibration splits, UQ scores, and analysis scripts.

arxiv arXiv cs.CL · 6h ago

Space-Efficient Language Generation in the Limit

This study initiates a resource-aware theory of language generation in the limit under space efficiency constraints. A learner observes an adversarial positive stream from a target language K and must output a hallucination-free hypothesis L while omitting at most Δ strings. The research focuses on DFAs with s states over an alphabet of size k as the hypothesis class for memory-bounded learners. In the exponential-space regime, the authors prove that a learner can exactly identify the target language K. Under stricter memory budgets, they present a streaming algorithm using poly(s,k) space that converges to a hypothesis with a generation gap of Δ= O(k^{2s-2}). This learned hypothesis captures every string in K of length at least 2s-1. The results are complemented by a near-matching lower bound derived from communication complexity, showing that achieving Δ≤ k^{(1-ε)s} requires k^{Ω(εs)} memory. These findings reveal a sharp transition between polynomial-space generation and exponential-space exact identification.

arxiv arXiv cs.CL · 6h ago

How Large Language Models Source Brand Reputation Across Languages and Markets

This study analyzes the citation sources used by large language models when answering questions about brands, focusing on the underlying web references rather than just the generated text. The researchers merged three Rankfor.AI datasets to examine 167,551 URL-grounded citations across 128 brands in 12 home markets and 13 languages. The analysis reveals that AI grounds brand answers overwhelmingly in third-party sources, with 85.7% of citations pointing to sites the brand does not own compared to only 14.3% for owned domains. The source base is highly concentrated and follows a Zipf law, where 80% of citations originate from approximately 18% of domains. Wikipedia emerges as the dominant reference site, being the most-cited domain in 11 of the 12 languages studied. The only exception is Lithuanian, where the business daily vz.lt slightly edges out Wikipedia with a 4.38% share. Additionally, the source mix shows market-specific variations, such as YouTube being the top cited domain for Polish national brands and HR portals supplying more citations than Polish Wikipedia.

arxiv arXiv cs.CL · 6h ago

ToolBench-X: Benchmarking Tool-Using Agents Under Unreliable Environments

The authors introduce ToolBench-X, a new benchmark designed to evaluate large language model agents under recoverable tool-environment unreliability. Unlike existing benchmarks that assume clean and stable environments, this framework injects five structured hazard types: Specification Drift, Invocation Error, Execution Failure, Output Drift, and Cross-source Conflict. The dataset contains executable multi-step tasks across diverse domains with deterministic tools and canonical final answers for automatic evaluation. Crucially, every injected instance remains solvable through valid recovery paths such as retrying, fallback, or verification. Experiments reveal a substantial reliability gap where agents performing well with reliable tools often fail under these hazards. Further analysis indicates that failures stem from limited hazard diagnosis and ineffective recovery rather than tool-use volume or inference budget. Targeted recovery hints successfully recover many failed tasks, whereas test-time scaling yields more limited gains. These findings suggest that evaluation must shift focus from function-call accuracy to task completion in unreliable environments.

arxiv arXiv cs.CL · 6h ago

SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment

Sparse Mixture-of-Experts (MoE) architectures often struggle with low-resource languages due to cross-lingual routing divergence that limits expert sharing. To address this, researchers propose SARA, a framework that transfers specialized capabilities from high-resource anchor languages to low-resource ones. SARA aligns the internal routing distributions of MoE layers using a symmetric Jensen-Shannon divergence constraint rather than operating on output logits. This approach encourages mechanistic consistency in expert selection across different languages. The authors evaluated the method on two large language models across five low-resource languages and three benchmarks. Results show SARA outperforms standard instruction tuning, achieving gains of +0.8% on Qwen3-30B-A3B and +1.2% on Phi-3.5-MoE-instruct for Global-MMLU. These findings demonstrate that SARA effectively addresses performance bottlenecks in low-resource contexts.

arxiv arXiv cs.LG · 7h ago

Select-to-Act: Hierarchical Reinforcement Learning via Adaptive Language Guidance

The paper introduces HRLLI, a hierarchical reinforcement learning framework designed to improve sample efficiency by leveraging natural-language instructions. It addresses the limitation of existing approaches that treat instructions as static inputs, failing to account for their stage-dependent relevance in complex environments. The proposed method decomposes instructions into piecewise guidance elements that become relevant at different interaction stages. A novel Select-to-Act paradigm is formulated where a high-level semantic policy acts as a selector for the most relevant instruction piece based on the current state. This selected guidance conditions a low-level policy that executes environment actions, with both policies learned simultaneously to maximize augmented expected returns. Experiments on the RTFM benchmark demonstrate that HRLLI consistently outperforms strong instruction-conditioned RL baselines. The results confirm that explicitly modeling adaptive instruction selection significantly enhances reinforcement learning effectiveness.

arxiv arXiv cs.LG · 7h ago

SAFER: Reliability-Guided Adaptive Ensembling for Robust Test-Time Adaptation

The authors address the brittleness of test-time adaptation (TTA) under adversarially contaminated streams by proposing SAFER, a training-free framework for robust TTA. SAFER acts as an augmentation wrapper that replaces single-view predictions with a reliability-guided pooled predictor to stabilize online updates. For each test sample, the method generates stochastic augmentations and aggregates their outputs using correlation-weighted pooling combined with outlier detection. An adaptive-mixing extension is also introduced, which adjusts the weighting between original and augmented inputs based on feature disagreement signals to preserve clean performance. The researchers evaluated SAFER on PACS, VLCS, and OfficeHome benchmarks under PGD attacks at various rates. Results indicate that SAFER improves the resilience of TTA methods against adversarial attacks while maintaining competitive accuracy on clean data.

arxiv arXiv cs.LG · 7h ago

Parsimoniously Activated Dictionary Learning Links Sparsity and Storage to Generative Models

The paper introduces parsimoniously activated dictionary learning (PADL), a method imposing global regularization on the number of activated dictionary atoms. It demonstrates that PADL is equivalent to maximum a posteriori estimation under a structured generative model with auxiliary latent variables. This equivalence enables the derivation of generalization guarantees that are difficult to obtain from the original formulation. The authors provide an analytical characterization of the tradeoff between sparsity, storage cost, and reconstruction accuracy. This framework allows for data-driven estimation of optimal hyperparameters without manual tuning. An efficient and interpretable PADL algorithm is developed based on this theoretical connection. Experimental results show improved reconstruction performance under comparable sparsity levels on visual benchmarks. The method also demonstrates practical utility in accelerating inference for vision-language models.

arxiv arXiv cs.LG · 7h ago

ORBIT: Training-Free Multi-Attribute Behavioral Steering via Orthogonal Subspace Rotation

The authors introduce ORBIT, a training-free method for simultaneously controlling multiple behavioral attributes in large language models. Existing activation steering techniques struggle with multi-attribute control due to norm imbalance and directional cancellation when using naive vector summation. ORBIT addresses this by constructing a joint subspace from per-attribute steering planes via singular value decomposition. It then applies a single norm-preserving rotation within that subspace toward a combined target direction. The method incorporates adaptive per-token gating to identify necessary corrections at each position and an optional additive boost for weak projections. To evaluate the approach, the authors present TraitFactory, a benchmark focusing on behavioral tendencies rather than surface style. Experiments across Llama-3.2-3B, Qwen-2.5-7B, and Llama-3.1-8B models demonstrate that ORBIT achieves stronger and more balanced steering than baselines while preserving output coherence.

arxiv arXiv cs.LG · 7h ago

Reference-Free Assessment of Physical Consistency in World Model-based Video Generation

The authors introduce reference-free measures for evaluating the physical consistency of generated videos by combining relative and absolute fidelity assessments. This approach addresses the gap in physical fidelity that often prevents video generation tools like WorldGym or WorldEval from accurately reproducing real-world task success rates for VLA models. Unlike existing methods requiring costly human voting or unavailable ground-truth references, the new framework utilizes DROID-SLAM and SEA-RAFT to quantify inconsistencies. Motivated by WorldScore, the relative consistency assessment filters videos to improve task success rates by over 8%. Additionally, the absolute assessment enables spatio-temporal localization to visualize when and where physical artifacts occur in the generated content.

arxiv arXiv cs.LG · 7h ago

Kiwano: An Open-Source PyTorch Toolkit for Speaker Verification Research

Researchers have introduced Kiwano, an open-source toolkit designed to advance research and evaluation in the field of speaker verification. Built on PyTorch, this lightweight yet extensible framework provides standardized recipes, pretrained models, and integration of widely used architectures. The project emphasizes reproducibility by delivering transparent training pipelines, unified evaluation protocols, and ready-to-use baselines across multiple corpora. Beyond standard training and inference capabilities, Kiwano includes specialized tools for benchmarking, experiment tracking, and the rapid prototyping of new architectures. To encourage community adoption, the toolkit is distributed under the Apache 2.0 license and is accompanied by comprehensive documentation and reproducible experiments. By lowering entry barriers and standardizing evaluation practices, Kiwano aims to serve as a valuable resource for both academic research and applied development. The project is publicly available on GitHub at https://github.com/kiwano-toolkit/kiwano/.

arxiv arXiv cs.LG · 7h ago

Multigrid Training for Molecular Generation using Graph Neural Networks

The authors introduce a multigrid training strategy to address the high computational costs and instability associated with modeling biochemical molecular systems at full resolution. This approach leverages low-resolution optimization to accelerate learning at higher resolutions by transferring parameters across different discretizations. For graph-based molecular representations, the method progressively transfers parameters from a coarse graph to increasingly finer graphs using biased random walk upsampling. In 3D molecular generation, structures are voxelized at multiple resolutions, allowing a coarse-resolution conditional Variational Autoencoder (CVAE) to be pretrained first. Shape-compatible convolutional parameters are then transferred from the coarse model to initialize a fine-resolution CVAE. Numerical experiments on receptor-conditioned 3D ligand generation demonstrate that this method accelerates convergence compared to training from scratch. Additionally, the study shows that multigrid training improves generalization capabilities for molecular generation tasks.