Evaluation & benchmarks
arxiv arXiv cs.CL · 7h ago

SFL-MTSC: Leveraging Semantic Frame-Level Multi-Task Self-Consistency for Robust Multi-Intent Spoken Language Understanding

Prompt-based spoken language understanding with large language models often suffers from inconsistent intent-slot structures due to decoding stochasticity, particularly in multi-intent scenarios. To address this, researchers propose Semantic Frame-Level Multi-Task Self-Consistency (SFL-MTSC), a novel structured aggregation framework operating at the semantic frame level. Instead of relying on output-level majority voting, SFL-MTSC decomposes predictions into intent-specific frames and applies domain-intent grouping alongside slot-level clustering. The framework evaluates cluster reliability using path support scoring to determine which frames are trustworthy. Reliable frames are retained and re-integrated to form the final prediction, ensuring greater structural consistency. Zero-shot experiments on the MAC-SLU benchmark dataset demonstrate improved slot F1 scores and overall accuracy compared to single-path inference. Intent accuracy remains largely stable across most settings while achieving these gains in slot-level performance.

arxiv arXiv cs.CL · 8h ago

MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction

The authors propose MedGuards, a medical safety guardrail framework designed to detect and correct errors in text generated by Large Language Models. This system treats error handling as a multi-agent in-context learning task where specialized agents separately perform detection, localization, and correction. A confidence-guided arbitration mechanism resolves disagreements among agents using reasoning traces and confidence scores without requiring additional model training. The study introduces the Keyword-Prioritized Correction Score (KPCS), a new metric that evaluates the accuracy of critical keywords within reference text. Experiments conducted across four multilingual medical datasets of clinical notes demonstrate significant improvements in performance metrics. These results highlight enhanced interpretability, robustness, and adaptability for safer LLM deployment in healthcare. The code for the MedErrBench benchmark is publicly available on GitHub.

arxiv arXiv cs.CL · 8h ago

RAS: Measuring LLM Safety Through Refusal Alignment

The authors propose SafeVec, a white-box evaluation procedure that measures LLM safety using internal representations instead of generated outputs. This method extracts layer-wise refusal directions from a safety-aligned reference model to identify stable layers where safe and unsafe behaviors are separable. It then scores target models by checking if their hidden states align with these refusal directions during unsafe prompts. The resulting metric, RAS (Refusal Alignment Score), maps this alignment to a calibrated 0-100 safety score. Experiments across Llama, Gemma, and Qwen families show RAS effectively separates aligned models from uncensored variants. Additionally, the metric tracks output-level attack success rates while being substantially faster than judge-based evaluations. These findings suggest refusal alignment offers a compact and efficient signal for white-box safety assessment.

arxiv arXiv cs.CL · 8h ago

Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation

This study evaluates whether fine-tuned ModernBERT encoder classifiers can serve as cost-effective alternatives to LLM-based judges for safety evaluation. The researchers benchmarked ModernBERT and Ettin against rule-based prefix matching, fine-tuned LLM classifiers, and various LLM judge methodologies. These LLM judges included strategies from StrongReject, ShieldGemma, JailbreakBench, AILuminate, SorryBench, Claude-as-a-judge, and models like LlamaGuard 3 and 4. The encoder classifiers were trained on judge-labeled data using a majority-voting label strategy and tested on a gold-standard holdout dataset. Performance was measured using F1 score, false negative rate, and precision-recall metrics across open-source adversarial datasets. Results were further analyzed by attack technique, including single-turn prompting, decomposition, escalation, and context manipulation. The findings provide guidance on when encoder classifiers can reliably replace LLM-based judges without substantial performance loss.

arxiv arXiv cs.CL · 9h ago

Argus Benchmark Evaluates Uncertainty Quantification Stability Across Vision-Language Models and GUI Grounding Datasets

The authors introduce Argus, a benchmark designed to evaluate post-hoc uncertainty quantification for computer-use agents that translate vision-language model predictions into executable GUI actions. The study assesses 28 open-weight methods across four VLM agents and four datasets, alongside eight closed-source methods from three vendors where internal model states are inaccessible. Key findings reveal selective transfer stability, where uncertainty rankings remain consistent across different datasets for a fixed model but degrade significantly when moving between different model classes or observable interfaces. Among open-weight options, hidden-state and density estimation techniques demonstrated the highest stability, while specific regimes favored sampling-based scores or verbalized self-assessment. Within-model ranking transfer proved strong with Spearman rho values up to 0.969, whereas cross-tier transfer to closed-source vendors averaged only +0.08. The research further indicates that conformal click regions shrink radii by 40-60 percent upon calibration but suffer coverage degradation under interface mismatch. To support regime-aware selection, the authors release per-item records, calibration splits, UQ scores, and analysis scripts.

arxiv arXiv cs.CL · 9h ago

How Large Language Models Source Brand Reputation Across Languages and Markets

This study analyzes the citation sources used by large language models when answering questions about brands, focusing on the underlying web references rather than just the generated text. The researchers merged three Rankfor.AI datasets to examine 167,551 URL-grounded citations across 128 brands in 12 home markets and 13 languages. The analysis reveals that AI grounds brand answers overwhelmingly in third-party sources, with 85.7% of citations pointing to sites the brand does not own compared to only 14.3% for owned domains. The source base is highly concentrated and follows a Zipf law, where 80% of citations originate from approximately 18% of domains. Wikipedia emerges as the dominant reference site, being the most-cited domain in 11 of the 12 languages studied. The only exception is Lithuanian, where the business daily vz.lt slightly edges out Wikipedia with a 4.38% share. Additionally, the source mix shows market-specific variations, such as YouTube being the top cited domain for Polish national brands and HR portals supplying more citations than Polish Wikipedia.

arxiv arXiv cs.CL · 9h ago

ToolBench-X: Benchmarking Tool-Using Agents Under Unreliable Environments

The authors introduce ToolBench-X, a new benchmark designed to evaluate large language model agents under recoverable tool-environment unreliability. Unlike existing benchmarks that assume clean and stable environments, this framework injects five structured hazard types: Specification Drift, Invocation Error, Execution Failure, Output Drift, and Cross-source Conflict. The dataset contains executable multi-step tasks across diverse domains with deterministic tools and canonical final answers for automatic evaluation. Crucially, every injected instance remains solvable through valid recovery paths such as retrying, fallback, or verification. Experiments reveal a substantial reliability gap where agents performing well with reliable tools often fail under these hazards. Further analysis indicates that failures stem from limited hazard diagnosis and ineffective recovery rather than tool-use volume or inference budget. Targeted recovery hints successfully recover many failed tasks, whereas test-time scaling yields more limited gains. These findings suggest that evaluation must shift focus from function-call accuracy to task completion in unreliable environments.

media Hugging Face Forums · 13h ago

Community Inquiry on Model Benchmarking Methods

A user on the Hugging Face discussion forum posted a question seeking advice on how to benchmark machine learning models. The inquiry was initiated by an individual who is new to the field of fine-tuning and wishes to evaluate their models after completion. The post explicitly asks for established methods or strategies that the community uses for this purpose. It highlights a common need among practitioners to understand standard evaluation practices in model development. The discussion thread currently contains only one post from a single participant. No specific benchmarks, metrics, or technical solutions were provided within the visible content of the source.

arxiv arXiv cs.LG · 17h ago

Small Language Models Outperform Frontier LLMs in Relation Extraction

A fine-tuned 0.5B-parameter Qwen2.5 model achieves 0.83 micro-F1 in general-domain relation extraction, surpassing zero-shot GPT-5.4 and Claude Sonnet 4.6. On literary benchmarks, it reaches 0.92 on the Biographical dataset, outperforming GPT-5.4 and exceeding frontier models in accuracy, demonstrating that task-adapted small models can deliver high performance with minimal hardware and privacy overhead.

arxiv arXiv cs.AI · 17h ago

BabelJudge: Measuring LLM-as-a-Judge Reliability Across Languages and Agent Trajectories

BabelJudge introduces an open-source framework to measure four key bias modes in LLM judges across languages and agent trajectories. It reveals a significant reliability drop from Hindi to Swahili—0.714 to 0's 0.550—highlighting cross-lingual degradation invisible to raw accuracy. The framework enables bias-aware evaluation without human labels, using controlled perturbations to create known gold labels, and extends to agentic workflows with new metrics on tool accuracy and hallucination detection.

arxiv arXiv cs.AI · 18h ago

SAFER: Reliable Test-Time Adaptation under Adversarial Streams

SAFER is a training-free framework that enhances robustness of test-time adaptation by using reliability-guided augmentation. It generates stochastic augmentations, pools predictions via correlation-weighted aggregation with outlier detection, and includes adaptive mixing to preserve clean performance under adversarial attacks. Evaluations on PACS, VLCS, and OfficeHome show improved resilience without sacrificing clean accuracy.

arxiv arXiv cs.AI · 18h ago

Gold Points Sniper: Self-guided Visual Reasoning for Fine-grained Action Understanding

Gold Points Sniper (GPS) enables lightweight vision-language models to perform self-guided multimodal reasoning for fine-grained human action understanding. By integrating a Gold Points Extractor, Selective Socratic Questioner, and Semantic Entailment Evaluator, GPS achieves performance comparable to GPT-4o while maintaining superior factual accuracy on CAP benchmark-based instruction-tuning data.