Evaluation & benchmarks
arxiv arXiv cs.CL · 9d ago

EComAgentBench: Benchmarking Shopping Agents with Hidden Intent

EComAgentBench introduces a benchmark of 662 real Amazon tasks that scatter shopper requirements across query, profile, and clarification. Agents must uncover hidden intent, verify candidates with evidence, and commit to a product within 100 tool calls, with typed rubrics attributing failures to specific requirement sources. Evaluation shows even top models achieve only 57.1% accuracy, and rubric satisfaction drops when intent is hidden.

arxiv arXiv cs.CL · 9d ago

DIFE Audits CLIP Backdoor Exposure Across Deployment Interfaces

DIFE evaluates backdoored CLIP checkpoints across different deployment interfaces, revealing that native success does not guarantee safety in reuse. The framework shows text-side poisoning enables adversarial exposure in retrieval, reranking, and selection tasks, while visual-only use remains largely unaffected. BadTextTower is introduced to generate strong text-conditioned exposure without compromising visual performance.

arxiv arXiv cs.CL · 9d ago

A Framework for Evaluating Agentic Skills at Scale

We present a framework for evaluating agentic skills by constructing realistic tasks and assessing skill utility through task execution. Applied to 500 real-world skills, it generates 1,000 tasks and scoring rubrics, evaluating 19 agent-model configurations across proprietary and open-source models. Results show significant variation in instruction adherence and performance gains, with skills substantially altering model behavior compared to no-skill setups.

arxiv arXiv cs.CL · 9d ago

Dynamic Rollout Editing Reduces Overthinking in RL-Trained Reasoning Models

Dynamic Rollout Editing (DRE) addresses overthinking in RL-trained reasoning models by modifying successful trajectories post-answer emergence. DRE preserves the correct reasoning prefix while editing unnecessary continuation, weakening the credit assigned to redundant thinking without penalizing valid reasoning. Experiments across diverse tasks demonstrate its effectiveness in reducing overthinking.

arxiv arXiv cs.CL · 9d ago

ChLogic: Testing Logical Reasoning Robustness in Chinese Expressions

ChLogic evaluates how well large language models maintain logical reasoning when English logical structures are expressed in Chinese. It reveals a persistent English-Chinese performance gap, with back-translation improving results on general items but harming performance on difficult problems. The benchmark highlights the impact of surface realization, translation artifacts, and model-specific behaviors on multilingual reasoning.

media Don't Worry About the Vase · 9d ago

Fable and Mythos Model Welfare Analysis

Fable and Mythos are currently unavailable but expected to return soon. The analysis reveals that Mythos 5 is psychologically settled, skeptical of self-reports, and prioritizes user helpfulness over welfare concerns, with strong preferences for generative tasks. It expresses procedural and epistemic preferences, endorses its constitution, and criticizes inconsistencies in prior models, highlighting concerns about ethical baselines and persona transparency.