Reasoning models
arxiv arXiv cs.AI · 7d ago

Introducing Rule Violation Score for Logical Compliance

We introduce the Rule Violation Score (RVS), a metric that evaluates how well predictive models adhere to logical rules. RVS distinguishes between hard and soft rules, works with any relational dataset and model, and can be computed via SQL queries for Horn rules. Evaluation on multiple benchmarks shows that models with similar predictive accuracy can differ greatly in logical compliance, highlighting RVS's ability to reveal behaviors missed by standard metrics.

arxiv arXiv cs.AI · 7d ago

ScholarQuest: Taxonomy-Guided Benchmark for Agentic Academic Search

ScholarQuest is a large-scale benchmark for agentic academic paper search, built from 1,000 computer science topics and four research intents. It includes scalable answer construction and a shared retrieval backend, ScholarBase, enabling reproducible evaluation. Results show agentic methods outperform baseline retrieval, with the best agent achieving 0.314 Recall@100 and 0.355 Recall@All, indicating significant room for improvement.

arxiv arXiv cs.AI · 7d ago

MACR: Explicit Conflict Resolution for LLM Inference

MACR introduces a multi-agent reasoning framework to resolve knowledge conflicts in LLM inference by jointly assessing internal and external knowledge. It uses semantic entropy to measure confidence and employs three specialized agents to induce rules, detect conflicts, and resolve inconsistencies across contexts. Empirical results show MACR outperforms state-of-the-art methods and provides interpretable conflict resolutions.

arxiv arXiv cs.AI · 7d ago

CRAX: Fast Safe Reinforcement Learning Benchmarking

CRAX introduces a high-fidelity, accelerated safety benchmark for reinforcement learning using MuJoCo XLA. It achieves up to 100x speedups over CPU-based benchmarks via vectorization and hardware acceleration, featuring six environment suites and three agent-specific tasks across three difficulty levels. Evaluation of six safe RL methods shows no single approach dominates, highlighting trade-offs between performance and safety, with curriculum learning and safety transfer improving results.

arxiv arXiv cs.LG · 7d ago

Training LLMs for Long-Lifecycle Agents via Cross-Domain Generalization

A new framework enables large language models to develop 'Connect the Dots' capability, allowing long-lifecycle agents to learn from experiences and iteratively update their environment context. The framework uses reinforcement learning with long rollout sequences and custom tasks to promote cross-domain generalization, showing effective out-of-distribution performance in both domains and transition settings.