Topic · Research paper
arxiv arXiv cs.CL · 2d ago

OpenBioRQ: Benchmark for Agentic Biomedical Research Faithfulness

OpenBioRQ introduces a benchmark of 12,553 unsolved biomedical research questions across 12 domains, designed to test agentic models' faithfulness and abstention. It evaluates models in a tool-using setting without answer keys, using real follow-up evidence rather than parametric knowledge, and reveals significant agentic collapse on the hardest questions where tools are no longer used despite being critical.

media Hugging Face Forums · 3d ago

I built a novel triple-hybrid LLM under 1B parameters for ~$50

Mateusz has developed a full pre-trained language model, Project Inkblot's Titan v1, combining Mamba SSM, Multi-Head Attention, and 32-expert MoE in a single decoder-only architecture under 1B parameters. The model, trained on a single NVIDIA L4 GPU for ~$50, achieves 27.5 validation perplexity and demonstrates efficient scaling via a single-line config update, with all components implemented from scratch in PyTorch. Titan v2's first training cycle is now complete, and dataset expansion is underway.

arxiv arXiv cs.AI · 6d ago

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization

ScaffoldAgent introduces a utility-guided framework for dynamic outline optimization in open-ended deep research. It models outline evolution through Expansion, Contraction, and Revision operations, guided by a feedback mechanism that evaluates retrieval gain, structural coherence, and generation quality. Experiments show it improves long-form report generation and factual grounding compared to existing agents.

arxiv arXiv cs.AI · 16h ago

Grounded Scaling: Determinism as a Core Limit in Agentic AI

Agentic AI performance degrades exponentially in non-deterministic environments, with k-step success falling as δ^k when per-step determinism δ < 1. The paper introduces a framework linking environment determinism to task success, verifiability, and skill evolution, proposing a Supply Certainty Index and a five-level Determinism Maturity Model. It challenges prevailing views by identifying determinism as a binding constraint across compute, data, embodiment, and alignment.

arxiv arXiv cs.AI · 17h ago

Concept-Constrained Prompt Learning for Few-Shot CLIP Adaptation

CCPL introduces a lightweight framework that anchors class prompts to frozen concept prototypes, improving few-shot CLIP adaptation. It achieves better base-to-new performance on DTD and EuroSAT compared to CoOp, with consistent gains from text-space concept regularization, while maintaining neutrality on OxfordPets. The method uses concept dropout and controllable ensemble fusion at inference, with results sensitive to dataset semantics and protocol.

media r/LocalLLaMA · 18h ago

Baidu's Unlimited-OCR Transcribes Dozens of Pages in One Forward Pass

Baidu has released Unlimited-OCR, a model that transcribes dozens of pages in a single forward pass using Reference Sliding Window Attention (R-SWA). It builds on DeepSeek-OCR, inheriting its encoder, image compression, and MoE architecture, with only 500M active parameters per token. The model achieves 93.92% accuracy on OmniDocBench v1.6, outperforming DeepSeek-OCR's 87.01% on v1.5, though vendor-reported results warrant independent validation.

arxiv arXiv cs.LG · 18h ago

TeaNet Improves Few-Shot Learning in Vibrational Spectroscopy

TeaNet, a task-enhanced augmentation network, reconstructs randomly masked spectra to generate augmented samples that preserve original spectral features while introducing domain-specific variations. This approach enables deep neural networks to identify discriminant wavenumbers more effectively, outperforming CNNs by 17% in challenging synthetic scenarios and offering improved interpretability in few-shot learning tasks.