Evaluation & benchmarks
arxiv arXiv cs.CL · 8h ago

How Large Language Models Source Brand Reputation Across Languages and Markets

This study analyzes the citation sources used by large language models when answering questions about brands, focusing on the underlying web references rather than just the generated text. The researchers merged three Rankfor.AI datasets to examine 167,551 URL-grounded citations across 128 brands in 12 home markets and 13 languages. The analysis reveals that AI grounds brand answers overwhelmingly in third-party sources, with 85.7% of citations pointing to sites the brand does not own compared to only 14.3% for owned domains. The source base is highly concentrated and follows a Zipf law, where 80% of citations originate from approximately 18% of domains. Wikipedia emerges as the dominant reference site, being the most-cited domain in 11 of the 12 languages studied. The only exception is Lithuanian, where the business daily vz.lt slightly edges out Wikipedia with a 4.38% share. Additionally, the source mix shows market-specific variations, such as YouTube being the top cited domain for Polish national brands and HR portals supplying more citations than Polish Wikipedia.

arxiv arXiv cs.CL · 8h ago

ToolBench-X: Benchmarking Tool-Using Agents Under Unreliable Environments

The authors introduce ToolBench-X, a new benchmark designed to evaluate large language model agents under recoverable tool-environment unreliability. Unlike existing benchmarks that assume clean and stable environments, this framework injects five structured hazard types: Specification Drift, Invocation Error, Execution Failure, Output Drift, and Cross-source Conflict. The dataset contains executable multi-step tasks across diverse domains with deterministic tools and canonical final answers for automatic evaluation. Crucially, every injected instance remains solvable through valid recovery paths such as retrying, fallback, or verification. Experiments reveal a substantial reliability gap where agents performing well with reliable tools often fail under these hazards. Further analysis indicates that failures stem from limited hazard diagnosis and ineffective recovery rather than tool-use volume or inference budget. Targeted recovery hints successfully recover many failed tasks, whereas test-time scaling yields more limited gains. These findings suggest that evaluation must shift focus from function-call accuracy to task completion in unreliable environments.

media Hugging Face Forums · 11h ago

Community Inquiry on Model Benchmarking Methods

A user on the Hugging Face discussion forum posted a question seeking advice on how to benchmark machine learning models. The inquiry was initiated by an individual who is new to the field of fine-tuning and wishes to evaluate their models after completion. The post explicitly asks for established methods or strategies that the community uses for this purpose. It highlights a common need among practitioners to understand standard evaluation practices in model development. The discussion thread currently contains only one post from a single participant. No specific benchmarks, metrics, or technical solutions were provided within the visible content of the source.

arxiv arXiv cs.LG · 15h ago

Small Language Models Outperform Frontier LLMs in Relation Extraction

A fine-tuned 0.5B-parameter Qwen2.5 model achieves 0.83 micro-F1 in general-domain relation extraction, surpassing zero-shot GPT-5.4 and Claude Sonnet 4.6. On literary benchmarks, it reaches 0.92 on the Biographical dataset, outperforming GPT-5.4 and exceeding frontier models in accuracy, demonstrating that task-adapted small models can deliver high performance with minimal hardware and privacy overhead.

arxiv arXiv cs.AI · 16h ago

BabelJudge: Measuring LLM-as-a-Judge Reliability Across Languages and Agent Trajectories

BabelJudge introduces an open-source framework to measure four key bias modes in LLM judges across languages and agent trajectories. It reveals a significant reliability drop from Hindi to Swahili—0.714 to 0's 0.550—highlighting cross-lingual degradation invisible to raw accuracy. The framework enables bias-aware evaluation without human labels, using controlled perturbations to create known gold labels, and extends to agentic workflows with new metrics on tool accuracy and hallucination detection.

arxiv arXiv cs.AI · 17h ago

SAFER: Reliable Test-Time Adaptation under Adversarial Streams

SAFER is a training-free framework that enhances robustness of test-time adaptation by using reliability-guided augmentation. It generates stochastic augmentations, pools predictions via correlation-weighted aggregation with outlier detection, and includes adaptive mixing to preserve clean performance under adversarial attacks. Evaluations on PACS, VLCS, and OfficeHome show improved resilience without sacrificing clean accuracy.

arxiv arXiv cs.AI · 17h ago

Gold Points Sniper: Self-guided Visual Reasoning for Fine-grained Action Understanding

Gold Points Sniper (GPS) enables lightweight vision-language models to perform self-guided multimodal reasoning for fine-grained human action understanding. By integrating a Gold Points Extractor, Selective Socratic Questioner, and Semantic Entailment Evaluator, GPS achieves performance comparable to GPT-4o while maintaining superior factual accuracy on CAP benchmark-based instruction-tuning data.

arxiv arXiv cs.AI · 18h ago

MMGist: A Comprehensive Multimodal Benchmark for 2027

MMGist is a curated multimodal benchmark with 7,262 items, designed to address flaws in existing vision-language benchmarks. It reduces evaluation size by 69% and improves cross-model discrimination by 78%, while preserving model rankings with a Spearman correlation of 0.98. The benchmark highlights visual logic as a key weakness and emphasizes the importance of visual dependency, discriminative power, and reliability in evaluation.

arxiv arXiv cs.AI · 19h ago

PRIME: Evaluating Prompt Resolution in Conflicting Instructions

PRIME introduces a framework to analyze how large language models handle conflicting instructions by generating calibrated conflicts in response length, format, and reasoning. The study finds that conflict type has a greater impact on model behavior than model size, revealing diverse failure modes across conflict categories. Results highlight the need for conflict awareness and suggest instruction following cannot be reliably assessed through isolated benchmarks alone.