Topic · Open weights
arxiv arXiv cs.CL · 7d ago

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

PragReST is a self-supervised framework that enhances large language models' pragmatic reasoning by generating counterfactual reasoning traces and training via supervised fine-tuning and reinforcement learning. It outperforms baseline models on four pragmatic benchmarks, improving Qwen3-8B and Qwen3-14B by 5.37% and 5-5.50% accuracy respectively, and maintains strong performance on general-knowledge and mathematical reasoning tasks.

arxiv arXiv cs.CL · 7d ago

Misfired Alignment in LLMs: A Quantitative Study

A new study introduces VETO, a benchmark of 2,032 BBQ-derived contrastive pairs, to quantify misfired alignment in large language models. It defines the Misfired Alignment Rate (MAR) and finds that all benchmarked LLMs exhibit MARs between 4.7% and 18.9%, while human participants achieve 0%. The research shows alignment cues can amplify these failures, with evidence suppression occurring in late layers of models and emerging after instruction training.

arxiv arXiv cs.CL · 8d ago

RubricsTree: Scalable Evaluation Framework for Personal Health Agents

RubricsTree introduces a hierarchical taxonomy of over 100 clinically-verifiable Boolean rubrics, evolved from 4,000 real user queries via human-in-the-loop curation. It enables scalable, expert-aligned evaluation of personal health agents by dynamically routing queries to relevant rubrics and outperforms baseline methods in alignment, context sensitivity, and model performance gains of up to 66% on HealthBench.

arxiv arXiv cs.AI · 8d ago

RubricsTree: Scalable Evaluation Framework for Personal Health Agents

RubricsTree introduces a hierarchical taxonomy of over 100 clinically-verifiable Boolean rubrics, evolved from 4,000 real user queries via human-in-the-loop curation. It enables scalable, expert-aligned evaluation of personal health agents by dynamically routing queries to relevant rubrics and outperforms baseline methods in alignment, context degradation detection, and model performance gains of up to 66% on HealthBench.

arxiv arXiv cs.CL · 9d ago

LOGOS: A General-Purpose Generative Model for Natural Sciences

LOGOS is a unified generative language model that represents scientific objects and their interactions as token sequences in a shared grammar. It achieves consistent or superior performance across diverse natural science tasks, demonstrating the feasibility of a single model serving multiple domains. The model scales positively with parameter count, and its design suggests that AI for Science should align deeply with large language models through shared architectures and training.

arxiv arXiv cs.CL · 7d ago

TW-LegalBench: Evaluating LLMs on Taiwanese Law

TW-LegalBench introduces a benchmark using Taiwan's public legal corpus to assess large language models' performance in Taiwanese law. It includes 16,000+ multiple-choice questions, 117 open-ended essay questions with scoring rubrics, and 14,000+ judgment prediction instances. Evaluation shows top models exceed lawyer passing thresholds (11%) but fall short of judge/prosecutor levels (1-2%), and struggle with precise legal article citations in sentencing predictions.

arxiv arXiv cs.LG · 8d ago

Baseline Evaluation of Open-Source LLMs for Multi-Label ATT&CK Classification

A ground-truth dataset of 2,076 human-annotated sentences from 83 complex CTI reports was constructed and mapped to 114 ATT&CK techniques with \k{appa} = 0.68 inter-annotator agreement. Seven open-source LLMs ranging from 8B to 236B parameters were evaluated, achieving a maximum micro-averaged F1 score of 0.22. Parameter size showed a statistically significant positive correlation with F1 score, while prompt strategy and temperature did not yield significant improvements, indicating current open-source LLMs are insufficient for production-grade ATT&CK classification.

arxiv arXiv cs.AI · 8d ago

IUU+DB: LLM-Driven Database for Illegal Fishing and Supply Chain Crimes

IUU+DB is a large language model-driven system that tracks illegal, unreported, and unregulated fishing, seafood fraud, and labor abuse. It extracts key data elements from diverse documents, classifies relevant incidents, and enables trend analysis to identify geographic and behavioral hotspots. The system supports research, risk assessments, and policy enforcement in fisheries and supply chains.