Evaluation & benchmarks
media Hugging Face Forums · 10h ago

Community Inquiry on Model Benchmarking Methods

A user on the Hugging Face discussion forum posted a question seeking advice on how to benchmark machine learning models. The inquiry was initiated by an individual who is new to the field of fine-tuning and wishes to evaluate their models after completion. The post explicitly asks for established methods or strategies that the community uses for this purpose. It highlights a common need among practitioners to understand standard evaluation practices in model development. The discussion thread currently contains only one post from a single participant. No specific benchmarks, metrics, or technical solutions were provided within the visible content of the source.

arxiv arXiv cs.AI · 15h ago

BabelJudge: Measuring LLM-as-a-Judge Reliability Across Languages and Agent Trajectories

BabelJudge introduces an open-source framework to measure four key bias modes in LLM judges across languages and agent trajectories. It reveals a significant reliability drop from Hindi to Swahili—0.714 to 0's 0.550—highlighting cross-lingual degradation invisible to raw accuracy. The framework enables bias-aware evaluation without human labels, using controlled perturbations to create known gold labels, and extends to agentic workflows with new metrics on tool accuracy and hallucination detection.

arxiv arXiv cs.AI · 15h ago

SAFER: Reliable Test-Time Adaptation under Adversarial Streams

SAFER is a training-free framework that enhances robustness of test-time adaptation by using reliability-guided augmentation. It generates stochastic augmentations, pools predictions via correlation-weighted aggregation with outlier detection, and includes adaptive mixing to preserve clean performance under adversarial attacks. Evaluations on PACS, VLCS, and OfficeHome show improved resilience without sacrificing clean accuracy.

arxiv arXiv cs.AI · 16h ago

Gold Points Sniper: Self-guided Visual Reasoning for Fine-grained Action Understanding

Gold Points Sniper (GPS) enables lightweight vision-language models to perform self-guided multimodal reasoning for fine-grained human action understanding. By integrating a Gold Points Extractor, Selective Socratic Questioner, and Semantic Entailment Evaluator, GPS achieves performance comparable to GPT-4o while maintaining superior factual accuracy on CAP benchmark-based instruction-tuning data.

arxiv arXiv cs.AI · 16h ago

MMGist: A Comprehensive Multimodal Benchmark for 2027

MMGist is a curated multimodal benchmark with 7,262 items, designed to address flaws in existing vision-language benchmarks. It reduces evaluation size by 69% and improves cross-model discrimination by 78%, while preserving model rankings with a Spearman correlation of 0.98. The benchmark highlights visual logic as a key weakness and emphasizes the importance of visual dependency, discriminative power, and reliability in evaluation.

arxiv arXiv cs.AI · 18h ago

PRIME: Evaluating Prompt Resolution in Conflicting Instructions

PRIME introduces a framework to analyze how large language models handle conflicting instructions by generating calibrated conflicts in response length, format, and reasoning. The study finds that conflict type has a greater impact on model behavior than model size, revealing diverse failure modes across conflict categories. Results highlight the need for conflict awareness and suggest instruction following cannot be reliably assessed through isolated benchmarks alone.