Research paper — korshunov.ai

Topic · Research paper

OpenBioRQ introduces a benchmark of 12,553 unsolved biomedical research questions across 12 domains, designed to test agentic models' faithfulness and abstention. It evaluates models in a tool-using setting without answer keys, using real follow-up evidence rather than parametric knowledge, and reveals significant agentic collapse on the hardest questions where tools are no longer used despite being critical.

media Hugging Face Forums · 3d ago

I built a novel triple-hybrid LLM under 1B parameters for ~$50

Mateusz has developed a full pre-trained language model, Project Inkblot's Titan v1, combining Mamba SSM, Multi-Head Attention, and 32-expert MoE in a single decoder-only architecture under 1B parameters. The model, trained on a single NVIDIA L4 GPU for ~$50, achieves 27.5 validation perplexity and demonstrates efficient scaling via a single-line config update, with all components implemented from scratch in PyTorch. Titan v2's first training cycle is now complete, and dataset expansion is underway.

arxiv arXiv cs.LG · 6d ago

LLM Alignment Using Implicit User Feedback

A new dataset, IFLLM, collects mouse trajectories and eye gazing data from users interacting with LLMs. It shows that implicit feedback significantly improves LLM alignment, boosting text-based reward model accuracy from 55% to 64% and nearly tripling response quality improvements after DPO training on eight LLMs.

arxiv arXiv cs.CL · 6d ago

LLM Alignment Using Implicit User Feedback

arxiv arXiv cs.AI · 6d ago

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization

ScaffoldAgent introduces a utility-guided framework for dynamic outline optimization in open-ended deep research. It models outline evolution through Expansion, Contraction, and Revision operations, guided by a feedback mechanism that evaluates retrieval gain, structural coherence, and generation quality. Experiments show it improves long-form report generation and factual grounding compared to existing agents.

arxiv arXiv cs.CL · 9d ago

Routing Accuracy Degradation and Recovery in Enterprise Agent Systems

As enterprise agent tool catalogs scale from 10 to 110 agents, routing accuracy drops 16--23 percentage points on under-specified requests. An oracle analysis identifies retrieval and confusion gaps, with embedding-based shortlisting recovering +10--11pp F1. A human-annotated study of 1,435 utterances confirms real-world recovery of +10--17pp despite lower absolute performance.

arxiv arXiv cs.CL · 2d ago

NL2Scratch: Executable Benchmark for NL-to-Scratch Generation

NL2Scratch introduces an executable benchmark with 311,648 parser-valid NL-program pairs derived from real Scratch projects. It proposes Semantic Alignment Consistency (SAC) to measure semantic agreement, validating 23,594 examples and creating an 800-slot-balanced diagnostic benchmark. Experiments show a significant gap between lexical similarity and semantic alignment, with models achieving high token-level F1 often failing to reach perfect SAC, especially on longer examples.

arxiv arXiv cs.CL · 2d ago

Web Data Recipe for Medical Encoder Pretraining

A new method uses medical-term density filtering and signal-amplifying rephrasing to enhance French medical encoder pretraining. The approach outperforms educational quality filters and yields FineMed and DoctoBERT, achieving state-of-the-art results on DrBenchmark and a clinical NER task.

arxiv arXiv cs.CL · 2d ago

Plural Epistemologies in AI Language Technology

The paper argues that cultural alignment in NLP requires plural epistemologies, not just diverse data. It proposes a socio-technical model to analyze how multiple, locally grounded ways of knowing can be integrated into language technology, emphasizing that current approaches often fail to address deeper issues of power and governance.

arxiv arXiv cs.CL · 2d ago

BioMatrix: First Natively Multimodal Biological Foundation Model

BioMatrix integrates sequences, structures, and language for molecules and proteins in a single decoder-only architecture. It achieves state-of-the-art or competitive performance on 77 out of 80 downstream tasks, demonstrating effective multimodal generalist capabilities without external components.

arxiv arXiv cs.CL · 2d ago

Lexical Consensus Framework Shows Perceptual Distance Drives Word Learning

A study finds that artificial agents learn visual word meanings best when concepts are perceptually close, with acquisition accuracy strongly predicted by perceptual distance (partial R² = 0.245). Bidirectional evaluations reveal that retrieval performance depends on exemplar-based memory, not prototype matching, and frozen visual embeddings enable grounding while limiting learning without representational changes.

arxiv arXiv cs.CL · 2d ago

Large Language Models Fail to Translate Fongbe Accurately

Evaluations show Fongbe translations achieve poor quality (1.0-2.2/5) compared to Hausa's acceptable scores (4.0-4.5/5), with a consistent 3x BLEU gap. Automatic metrics like BERTScore show embedding collapse and weak human correlation, especially for Hausa, while Gemini outperforms others for Fongbe and GPT-4o for Hausa in human judgments. Minimum sample sizes of 2,500 sentences are needed for stable model rankings.

arxiv arXiv cs.CL · 2d ago

MixedPEFT: Unified PEFT for Unsupervised Domain Adaptation

MixedPEFT combines invertible adapters and LoRA within a single framework to enable unsupervised domain adaptation. It simultaneously optimizes classification on source data and masked language modeling on target data, achieving 1.41% improvement over UDapter, 1.26% over DANN, and 0.86% over DSN using only 7% of the model's parameters.

arxiv arXiv cs.CL · 2d ago

BabelJudge: Measuring LLM-as-a-Judge Reliability Across Languages and Agent Trajectories

BabelJudge introduces an open-source framework to measure four key bias modes in LLM judges across languages and agent trajectories. It reveals a significant reliability drop from Hindi to Swahili—0.714 to 0.550—highlighting that raw accuracy alone fails to capture critical failures like order inconsistency, which collapses to 0.480 in Swahili. The framework also extends to agentic evaluation with nine perturbations and three new metrics, supporting 11 judge backends via a Python package.

arxiv arXiv cs.CL · 2d ago

SciTraj: Claim-Grounded Typed Citation Graph for Research Evolution

SciTraj is the first claim-grounded typed citation corpus that links each citation to a specific claim sentence. It includes 32,559 papers from NLP, ML, and Vision (2015–2024) with 573,126 directed edges across six relation types, and 287M typed trajectories of length ≥3, covering 72.8% of papers. The corpus enables analysis of disciplinary siloing and topic emergence, with validated claim seeds and a temporally split link-prediction benchmark.

arxiv arXiv cs.CL · 2d ago

Curiosity as Linguistic Intervention in LLM Tutoring

CURIOBOT uses Berlyne's collative variables to create curiosity-driven linguistic interventions in tutoring dialogues. Across 270 conversations, these interventions increased exploratory behaviors by up to 2.4x in conversational turns under fixed time budgets, with gains persisting despite unchanged tutor instruction quality.

arxiv arXiv cs.CL · 2d ago

Character Variety in LLM-Generated Stories

This study compares characters in LLM-generated and human-written stories using narratological dimensions. It finds that while LLMs produce characters with similar basic traits, they lack diversity in complex character features like wholeness and stylization. The analysis reveals LLMs generate stories with limited character variety compared to human-written narratives.

arxiv arXiv cs.CL · 2d ago

ROMEVA: Geometry-Preserving Vocabulary Expansion for Roman Urdu Language Models

ROMEVA addresses sub-word fragmentation in Roman Urdu by combining sub-word-average initialization and PCA-guided anchor loss to stabilize embeddings. While ROMEVA best preserves pretrained embeddings, naive fine-tuning achieves superior sentiment classification performance, indicating a trade-off between embedding stability and downstream performance in morphologically inconsistent languages.

arxiv arXiv cs.CL · 3d ago

MacAgentBench Launches macOS AI Agent Benchmark

MacAgentBench introduces a comprehensive benchmark with 676 tasks across 25 applications, 60% of which involve both GUI and CLI interactions. It uses deterministic rule-based evaluation and fine-grained multi-checkpoint scoring, revealing that Claude Opus 4.6 on OpenClaw achieves 73.7% Pass@1, primarily due to its skill library rather than framework design.

arxiv arXiv cs.CL · 3d ago

Concept-Constrained Prompt Learning for Few-Shot CLIP Adaptation

CCPL introduces a lightweight framework that anchors class prompts to frozen concept prototypes, improving few-shot CLIP adaptation by reducing overfitting. It achieves better base-to-new performance on DTD and EuroSAT compared to CoOp, with consistent gains from text-space concept regularization, while maintaining neutrality on OxfordPets.

OpenBioRQ: Benchmark for Agentic Biomedical Research Faithfulness

I built a novel triple-hybrid LLM under 1B parameters for ~$50

LLM Alignment Using Implicit User Feedback

LLM Alignment Using Implicit User Feedback

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization

Routing Accuracy Degradation and Recovery in Enterprise Agent Systems

NL2Scratch: Executable Benchmark for NL-to-Scratch Generation

Web Data Recipe for Medical Encoder Pretraining

Plural Epistemologies in AI Language Technology

BioMatrix: First Natively Multimodal Biological Foundation Model

Lexical Consensus Framework Shows Perceptual Distance Drives Word Learning

Large Language Models Fail to Translate Fongbe Accurately

MixedPEFT: Unified PEFT for Unsupervised Domain Adaptation

BabelJudge: Measuring LLM-as-a-Judge Reliability Across Languages and Agent Trajectories

SciTraj: Claim-Grounded Typed Citation Graph for Research Evolution

Curiosity as Linguistic Intervention in LLM Tutoring

Character Variety in LLM-Generated Stories

ROMEVA: Geometry-Preserving Vocabulary Expansion for Roman Urdu Language Models

MacAgentBench Launches macOS AI Agent Benchmark

Concept-Constrained Prompt Learning for Few-Shot CLIP Adaptation