Topic · Research paper
arxiv arXiv cs.CL · 2d ago

OpenBioRQ: Benchmark for Agentic Biomedical Research Faithfulness

OpenBioRQ introduces a benchmark of 12,553 unsolved biomedical research questions across 12 domains, designed to test agentic models' faithfulness and abstention. It evaluates models in a tool-using setting without answer keys, using real follow-up evidence rather than parametric knowledge, and reveals significant agentic collapse on the hardest questions where tools are no longer used despite being critical.

media Hugging Face Forums · 3d ago

I built a novel triple-hybrid LLM under 1B parameters for ~$50

Mateusz has developed a full pre-trained language model, Project Inkblot's Titan v1, combining Mamba SSM, Multi-Head Attention, and 32-expert MoE in a single decoder-only architecture under 1B parameters. The model, trained on a single NVIDIA L4 GPU for ~$50, achieves 27.5 validation perplexity and demonstrates efficient scaling via a single-line config update, with all components implemented from scratch in PyTorch. Titan v2's first training cycle is now complete, and dataset expansion is underway.

arxiv arXiv cs.AI · 6d ago

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization

ScaffoldAgent introduces a utility-guided framework for dynamic outline optimization in open-ended deep research. It models outline evolution through Expansion, Contraction, and Revision operations, guided by a feedback mechanism that evaluates retrieval gain, structural coherence, and generation quality. Experiments show it improves long-form report generation and factual grounding compared to existing agents.

arxiv arXiv cs.CL · 2d ago

Do LLM Embedding Spaces Recover Expert Structure?

Pretrained LLM embeddings show measurable alignment with expert-defined mental-health symptom structure. Fine-tuning enhances this alignment, especially at fine category levels, with larger model sizes improving both zero-shot performance and supervised gains. Residual alignment persists after controlling for linguistic and stylistic confounds, indicating expert structure recovery is level-dependent and requires explicit confound testing.

arxiv arXiv cs.CL · 2d ago

Militarized Language Rising in Scientific Abstracts

Between 2010 and 2025, militaristic terms in scientific abstracts increased by 48% in OpenAlex and 32% in PubMed, with a sharp rise after 2019. The use of such language is aligned with global conflict levels and grows fastest in Global South publications, particularly in social sciences and engineering. A controlled experiment shows that war-framing reduces perceived credibility, funding willingness, and policy support, with a minor increase in urgency.

arxiv arXiv cs.CL · 2d ago

NL2Scratch: Executable Benchmark for NL-to-Scratch Generation

NL2Scratch introduces an executable benchmark with 311,648 parser-valid NL-program pairs derived from real Scratch projects. It proposes Semantic Alignment Consistency (SAC) to measure semantic agreement, validating 23,594 examples and creating an 800-slot-balanced diagnostic benchmark. Experiments show a significant gap between lexical similarity and semantic alignment, with models achieving high token-level F1 often failing to reach perfect SAC, especially on longer examples.

arxiv arXiv cs.CL · 2d ago

Lexical Consensus Framework Shows Perceptual Distance Drives Word Learning

A study finds that artificial agents learn visual word meanings best when concepts are perceptually close, with acquisition accuracy strongly predicted by perceptual distance (partial R² = 0.245). Bidirectional evaluations reveal that retrieval performance depends on exemplar-based memory, not prototype matching, and frozen visual embeddings enable grounding while limiting learning without representational changes.

arxiv arXiv cs.CL · 2d ago

Large Language Models Fail to Translate Fongbe Accurately

Evaluations show Fongbe translations achieve poor quality (1.0-2.2/5) compared to Hausa's acceptable scores (4.0-4.5/5), with a consistent 3x BLEU gap. Automatic metrics like BERTScore show embedding collapse and weak human correlation, especially for Hausa, while Gemini outperforms others for Fongbe and GPT-4o for Hausa in human judgments. Minimum sample sizes of 2,500 sentences are needed for stable model rankings.