All articles
arxiv arXiv cs.CL · 7h ago

The Complexity Ceiling Benchmark: A Multi-Domain Evaluation of Sequential Reasoning Under Depth Scaling

The Complexity Ceiling Benchmark (CCB) evaluates how language model reasoning decays as the required sequential steps increase, fixing semantic content while varying task depth from 5 to 50. The study reveals consistent geometric per-step decay across three distinct regimes: grounded spatial state-tracking, abstract symbolic pointer manipulation, and transitive relational inference.

arxiv arXiv cs.CL · 7h ago

Deterministic Decisions for High-Stakes AI

The article identifies "intervention bias" as a critical failure mode in zero-shot large-language-model educational advisory agents, where they incorrectly recommend action despite oracle policies mandating inaction. Using the Open University Learning Analytics Dataset, the study demonstrates that zero-shot GPT-4o exhibits a 43 percentage-point false-positive rate at day 56, leading to approximately 4,300 unnecessary advisor contacts per cycle for 10,000 students.

arxiv arXiv cs.LG · 8h ago

AsyncOPD: How Stale Can On-Policy Distillation Be?

This article presents AsyncOPD, a fully asynchronous on-policy distillation pipeline that decouples rollout generation from learner updates to alleviate training bottlenecks in large language model post-training. The authors provide the first systematic study of staleness effects in this context, demonstrating that teacher-weighted forward KL is robust to stale rollouts while student-weighted reverse KL is vulnerable.

arxiv arXiv cs.LG · 9h ago

Lightweight Transformer Models for On-Device Fault Detection: A Benchmark Study on Resource-Constrained Deployment

This study benchmarks traditional machine learning methods against lightweight transformer architectures for binary fault detection across three public datasets, evaluating tradeoffs between accuracy, model size, and latency. The research assesses classification performance using F1-score and AUC, while also testing INT8 dynamic quantization and a two-stage adaptive inference pipeline to optimize deployment on resource-constrained hardware.