Topic · Research paper
arxiv arXiv cs.CL · 2d ago

OpenBioRQ: Benchmark for Agentic Biomedical Research Faithfulness

OpenBioRQ introduces a benchmark of 12,553 unsolved biomedical research questions across 12 domains, designed to test agentic models' faithfulness and abstention. It evaluates models in a tool-using setting without answer keys, using real follow-up evidence rather than parametric knowledge, and reveals significant agentic collapse on the hardest questions where tools are no longer used despite being critical.

media Hugging Face Forums · 3d ago

I built a novel triple-hybrid LLM under 1B parameters for ~$50

Mateusz has developed a full pre-trained language model, Project Inkblot's Titan v1, combining Mamba SSM, Multi-Head Attention, and 32-expert MoE in a single decoder-only architecture under 1B parameters. The model, trained on a single NVIDIA L4 GPU for ~$50, achieves 27.5 validation perplexity and demonstrates efficient scaling via a single-line config update, with all components implemented from scratch in PyTorch. Titan v2's first training cycle is now complete, and dataset expansion is underway.

arxiv arXiv cs.AI · 6d ago

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization

ScaffoldAgent introduces a utility-guided framework for dynamic outline optimization in open-ended deep research. It models outline evolution through Expansion, Contraction, and Revision operations, guided by a feedback mechanism that evaluates retrieval gain, structural coherence, and generation quality. Experiments show it improves long-form report generation and factual grounding compared to existing agents.

arxiv arXiv cs.CL · 1d ago

AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation

AdversaBench introduces an end-to-end red-teaming pipeline that generates adversarial prompts via five structured operators, evaluates target models, and confirms failures through a three-judge panel with meta-judge tiebreaker. Experiments on 45 seed prompts across reasoning, instruction-following, and tool use show every seed produces a confirmed failure, with operator effectiveness, failure iteration counts, judge agreement, and cross-model transferability revealing key patterns in LLM vulnerability.

arxiv arXiv cs.AI · 1d ago

MedLayXPlain: Benchmarking Expert-Lay Gap in Medical Vision-Language Models

MedLayXPlain introduces the first large-scale benchmark for medical lay language generation, featuring 122,789 region-grounded samples across eight imaging modalities. It evaluates medical vision-language models on expert-lay alignment using a hierarchical ontology system and a lightweight evaluator, revealing a systematic gap: expert-level performance in captioning coexists with significant degradation in lay language, while general-purpose models lack clinical precision.

arxiv arXiv cs.AI · 1d ago

QBioFusion-QSAR: Quantum Kernel Learning for Small-Data Ligand Classification

QBioFusion-QSAR integrates a quantum fidelity kernel with Morgan/Tanimoto fingerprints to improve ligand classification. On the PsychLight-A benchmark, QMKL increased accuracy and MCC compared to Morgan/Tanimoto alone, with improvements attributed to better predictions of molecules with activity cliffs, such as N-Me-5-HT and N-Me-tryptamine. Auditable analysis confirms localized quantum-kernel contributions in small-data settings.