Topic · Research paper
arxiv arXiv cs.CL · 2d ago

OpenBioRQ: Benchmark for Agentic Biomedical Research Faithfulness

OpenBioRQ introduces a benchmark of 12,553 unsolved biomedical research questions across 12 domains, designed to test agentic models' faithfulness and abstention. It evaluates models in a tool-using setting without answer keys, using real follow-up evidence rather than parametric knowledge, and reveals significant agentic collapse on the hardest questions where tools are no longer used despite being critical.

media Hugging Face Forums · 3d ago

I built a novel triple-hybrid LLM under 1B parameters for ~$50

Mateusz has developed a full pre-trained language model, Project Inkblot's Titan v1, combining Mamba SSM, Multi-Head Attention, and 32-expert MoE in a single decoder-only architecture under 1B parameters. The model, trained on a single NVIDIA L4 GPU for ~$50, achieves 27.5 validation perplexity and demonstrates efficient scaling via a single-line config update, with all components implemented from scratch in PyTorch. Titan v2's first training cycle is now complete, and dataset expansion is underway.

arxiv arXiv cs.AI · 6d ago

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization

ScaffoldAgent introduces a utility-guided framework for dynamic outline optimization in open-ended deep research. It models outline evolution through Expansion, Contraction, and Revision operations, guided by a feedback mechanism that evaluates retrieval gain, structural coherence, and generation quality. Experiments show it improves long-form report generation and factual grounding compared to existing agents.

arxiv arXiv cs.AI · 1d ago

SOHET: Self-Supervised Transformer for Heterogeneous Event Streams

SOHET introduces a hierarchical transformer architecture with event-type-specific tabular encoders and self-supervised pre-training objectives. It outperforms existing methods by 5.8% on Booking.com's fraud detection task and achieves faster convergence with 2.4% additional gain from pre-training. On the EBES benchmark, bidirectional SOHET matches or exceeds the best published results on six out of eight tasks.

arxiv arXiv cs.AI · 1d ago

Graph-of-Differences for Anatomy-Structured MedReID

Graph-of-Differences (GoD) introduces anatomy-graph representations to enable medical image re-identification with explicit structural grounding. It computes differences across named anatomical regions and aligns them with global backbone differences, providing clinically auditable, structure-level explanations. GoD improves Rank-1 accuracy by 7.1 pp on fundus and 3.1 pp on CXR, with better performance on zero-shot transfers.

media Hugging Face Forums · 2d ago

Seeking arXiv cs.LG Endorsement for PsiLogic Optimizer

Ali, a 16-year-old independent researcher, has developed PsiLogic, a chaos-aware active cancellation optimizer based on Adam. Evaluated against AdamW and Lion using FairBench on an NVIDIA H100, PsiLogic achieved top validation metrics in three out of four tasks and is statistically tied in the fourth, though it incurs step-time overhead. The author seeks endorsement for arXiv submission under cs.LG, providing a GitHub repository and endorsement code 4ACC37.