Topic · Research paper
arxiv arXiv cs.CL · 2d ago

OpenBioRQ: Benchmark for Agentic Biomedical Research Faithfulness

OpenBioRQ introduces a benchmark of 12,553 unsolved biomedical research questions across 12 domains, designed to test agentic models' faithfulness and abstention. It evaluates models in a tool-using setting without answer keys, using real follow-up evidence rather than parametric knowledge, and reveals significant agentic collapse on the hardest questions where tools are no longer used despite being critical.

media Hugging Face Forums · 3d ago

I built a novel triple-hybrid LLM under 1B parameters for ~$50

Mateusz has developed a full pre-trained language model, Project Inkblot's Titan v1, combining Mamba SSM, Multi-Head Attention, and 32-expert MoE in a single decoder-only architecture under 1B parameters. The model, trained on a single NVIDIA L4 GPU for ~$50, achieves 27.5 validation perplexity and demonstrates efficient scaling via a single-line config update, with all components implemented from scratch in PyTorch. Titan v2's first training cycle is now complete, and dataset expansion is underway.

arxiv arXiv cs.AI · 6d ago

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization

ScaffoldAgent introduces a utility-guided framework for dynamic outline optimization in open-ended deep research. It models outline evolution through Expansion, Contraction, and Revision operations, guided by a feedback mechanism that evaluates retrieval gain, structural coherence, and generation quality. Experiments show it improves long-form report generation and factual grounding compared to existing agents.

arxiv arXiv cs.CL · 2d ago

Language-Model Panel for Political Position Measurement in Data-Sparse Regions

A new method uses large language models as fallible raters in a panel to measure political positions in regions with sparse data. Adding written axis definitions improves score consistency and agreement among raters, while Krippendorff's alpha of 0.86 indicates high reliability across models and labs. Disagreements highlight interpretive issues, suggesting the method detects referent problems rather than measurement errors.

arxiv arXiv cs.CL · 2d ago

AI Recommendation Ownership: Empirical Map of Brand Category Ownership

A study of 3,750 queries across five industries finds moderate recommendation concentration, with a mean Gini coefficient of 0.28. Cross-model agreement on top-recommended brands was only 41.6%, and displacement scores varied by industry, ranging from 0.4:1 to 4.3: 1. The results challenge the 'winner-takes-all' narrative and introduce three reproducible metrics for competitive-intelligence analysis.

arxiv arXiv cs.CL · 2d ago

Do LLM Embedding Spaces Recover Expert Structure?

Pretrained LLM embeddings show measurable alignment with expert-defined mental-health symptom structure. Fine-tuning enhances this alignment, especially at fine category levels, with larger model sizes improving both zero-shot performance and supervised gains. Residual alignment persists after controlling for linguistic and stylistic confounds, indicating expert structure recovery is level-dependent and requires explicit confound testing.

arxiv arXiv cs.CL · 2d ago

Militarized Language Rising in Scientific Abstracts

Between 2010 and 2025, militaristic terms in scientific abstracts increased by 48% in OpenAlex and 32% in PubMed, with a sharp rise after 2019. The use of such language is aligned with global conflict levels and grows fastest in Global South publications, particularly in social sciences and engineering. A controlled experiment shows that war-framing reduces perceived credibility, funding willingness, and policy support, with a minor increase in urgency.