Research paper — korshunov.ai

Topic · Research paper

A small-scale experiment shows that native binary embedding models achieve better retrieval than post-hoc binarization of float models. At SciFact Recall@10, native binary models (2048-dim and 4096-dim) outperform post-hoc binary models by 17% and 25% respectively, with significant speed and memory advantages in indexing.

arxiv arXiv cs.CL · 2d ago

OpenBioRQ: Benchmark for Agentic Biomedical Research Faithfulness

OpenBioRQ introduces a benchmark of 12,553 unsolved biomedical research questions across 12 domains, designed to test agentic models' faithfulness and abstention. It evaluates models in a tool-using setting without answer keys, using real follow-up evidence rather than parametric knowledge, and reveals significant agentic collapse on the hardest questions where tools are no longer used despite being critical.

media Hugging Face Forums · 3d ago

I built a novel triple-hybrid LLM under 1B parameters for ~$50

Mateusz has developed a full pre-trained language model, Project Inkblot's Titan v1, combining Mamba SSM, Multi-Head Attention, and 32-expert MoE in a single decoder-only architecture under 1B parameters. The model, trained on a single NVIDIA L4 GPU for ~$50, achieves 27.5 validation perplexity and demonstrates efficient scaling via a single-line config update, with all components implemented from scratch in PyTorch. Titan v2's first training cycle is now complete, and dataset expansion is underway.

arxiv arXiv cs.LG · 6d ago

LLM Alignment Using Implicit User Feedback

A new dataset, IFLLM, collects mouse trajectories and eye gazing data from users interacting with LLMs. It shows that implicit feedback significantly improves LLM alignment, boosting text-based reward model accuracy from 55% to 64% and nearly tripling response quality improvements after DPO training on eight LLMs.

arxiv arXiv cs.CL · 6d ago

LLM Alignment Using Implicit User Feedback

arxiv arXiv cs.AI · 6d ago

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization

ScaffoldAgent introduces a utility-guided framework for dynamic outline optimization in open-ended deep research. It models outline evolution through Expansion, Contraction, and Revision operations, guided by a feedback mechanism that evaluates retrieval gain, structural coherence, and generation quality. Experiments show it improves long-form report generation and factual grounding compared to existing agents.

arxiv arXiv cs.CL · 8d ago

Routing Accuracy Degradation and Recovery in Enterprise Agent Systems

As enterprise agent tool catalogs scale from 10 to 110 agents, routing accuracy drops 16--23 percentage points on under-specified requests. An oracle analysis identifies retrieval and confusion gaps, with embedding-based shortlisting recovering +10--11pp F1. A human-annotated study of 1,435 utterances confirms real-world recovery of +10--17pp despite lower absolute performance.

arxiv arXiv cs.CL · 1d ago

Quality-Aware Training Data Selection for Scientific Summarization

We construct and release a large biomedical dataset with 1.88 million PMC articles. Analysis shows author-written abstracts vary in quality and alignment with source articles, enabling effective training-data selection. Training on high-quality subsets outperforms random sampling and matches larger random subsets on factuality metrics.

arxiv arXiv cs.CL · 1d ago

PORTER: Language-Grounded Event Representations for Portable EHR Foundation Models

PORTER introduces a language-grounded structured EHR foundation model that represents clinical events via descriptions instead of fixed vocabularies. It achieves superior performance across 74 pediatric prediction tasks and transfers effectively to new vocabularies without retraining, recovering 97.1% of target AUROC and outperforming fixed-vocabulary models on MIMIC, with 329-fold lower compute than text serialization approaches.

arxiv arXiv cs.CL · 1d ago

LoRA Monitor Calibration Fails with Top-1 in Diffusion LMs

Top-1 argmax concentration fails as a collapse warning in LoRA-optimized diffusion language models, showing zero precision across 816 configurations. Max LoRA gradient norm outperforms this baseline, achieving 0.68 precision and 0.79 F1 on a held-out LLaDA split, though results are limited to short-horizon, family-specific inspections.

arxiv arXiv cs.CL · 1d ago

Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning

HDS introduces a multi-objective reinforcement learning framework for online data mixing in LLM pre-training. It achieves 44% fewer training iterations on The Pile benchmark and improves MMLU 0-shot performance by 7.2%, with consistent gains across other benchmarks.

arxiv arXiv cs.CL · 1d ago

InterAligner: Progressive Alignment for ASR

InterAligner introduces an intermediate aligner objective and InterCTC loss to enable progressive alignment formation in deep ASR models. On LibriSpeech with a 17-layer Conformer, it reduces WER from 5.0/7.8 to 3.1/5.6, with significant improvements on long utterances.

arxiv arXiv cs.CL · 1d ago

BehaviorBench Launches Benchmark for Behavioral AI Models

BehaviorBench introduces a comprehensive benchmark to evaluate foundation models across four behavioral science capabilities: behavior prediction, strategic decision-making, subject-trait inference, and knowledge application. It assesses models at both individual and distributional levels, revealing that behavioral foundation models like Be.FM-1.5 achieve stronger distributional alignment than general-purpose models, highlighting the need for distributional evaluation in behavioral AI.

arxiv arXiv cs.CL · 1d ago

CORE-BREW: LLR-Based Soft Decoding for Robust Multi-Bit LLM Watermarking

CORE-BREW introduces a soft-decision decoding method using calibrated log-likelihood ratios to enable robust multi-bit watermarking in LLMs. It achieves consistent hit rates and improved false-positive control through strict and FPR-calibrated detection modes, outperforming prior baselines under token-level edits and paraphrasing while preserving semantic quality.

arxiv arXiv cs.CL · 1d ago

Pāninian Foundation for Indic Language Processing

A new benchmark suite proposes leveraging Pānini's ancient grammar as a unifying framework for Indic language processing. This approach aims to improve accuracy, data efficiency, and transferability by grounding NLP tools in a shared morphosyntactic architecture. The framework raises questions about whether neural models internally represent Pānini's linguistic categories.

arxiv arXiv cs.CL · 1d ago

Agon: Autonomous Research System via Prompt Economy

Agon is an autonomous research system that uses prompt economy to validate checkable claims in workflows, leaving judgment to human scientists. It operates across 444 iterations with minimal prompts and no human-written code, revealing a taxonomy of failures by severity, fixability, visibility, and capability locus. The system demonstrates scalability and advances research toward a paradigm where machines handle scale and humans guide judgment.

arxiv arXiv cs.CL · 1d ago

Decoherence as Defence in Quantum Neural Networks for Intrusion Detection

A rigorous N-qubit theory proves that depolarising noise in stochastic quantum neural networks contracts Pauli read-outs exponentially, enabling robust anomaly detection. On the NSL-KDD dataset, such noise achieves significant adversarial resilience without catastrophic collapse, outperforming noiseless models and classical detectors under FGSM and PGD attacks, with reduced robustness variance and a train-test gap reduction of approximately 0.01.

arxiv arXiv cs.CL · 1d ago

SURGELLM: Task-Aware Feature Gating with Class-Balanced Normalization

SURGELLM introduces a unified transformer framework with surgical feature gating, task-conditioned prefix tokens, and Instance-Weighted Normalization to address inductive bias mismatches, class imbalance, and lack of lexical knowledge integration. The IWN variant achieves macro-F1 of 0.940 across four tasks, outperforming baselines by 0.036 overall and 0.130 on authorship detection, with gains confirmed as lexical rather than parametric.

arxiv arXiv cs.CL · 1d ago

Transformer Models: Architectures, Applications, and Critical Assessment

This review presents a taxonomy of transformer-based language models across domain verticals, covering encoder-only, decoder-only, encoder-decoder, long-context, permutation-based, and generator-discriminator variants. It evaluates post-2023 advancements like instruction tuning and mixture-of-experts scaling, and assesses model deployments in healthcare, finance, legal, education, customer service, creative writing, and scientific work, linking each to specific capabilities. The paper critically analyzes model architectures on four key deployment axes, quantifies parameter count versus energy cost, and examines how alignment methods, data provenance, and benchmark saturation define 'state of the art'.

arxiv arXiv cs.CL · 1d ago

PETRA: Dataset and Pipeline for Petroleum Engineering Text Adaptation

PETRA transforms public web text into a curated petroleum engineering corpus with synthetic supervision for dense retrieval and reranking. It improves in-domain nDCG from 0.703 to 0.763 and boosts Earth Science benchmark performance by 44% and a six-task reasoning panel by 23%.

Native binary embeddings outperform post-hoc binarization

OpenBioRQ: Benchmark for Agentic Biomedical Research Faithfulness

I built a novel triple-hybrid LLM under 1B parameters for ~$50

LLM Alignment Using Implicit User Feedback

LLM Alignment Using Implicit User Feedback

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization

Routing Accuracy Degradation and Recovery in Enterprise Agent Systems

Quality-Aware Training Data Selection for Scientific Summarization

PORTER: Language-Grounded Event Representations for Portable EHR Foundation Models

LoRA Monitor Calibration Fails with Top-1 in Diffusion LMs

Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning

InterAligner: Progressive Alignment for ASR

BehaviorBench Launches Benchmark for Behavioral AI Models

CORE-BREW: LLR-Based Soft Decoding for Robust Multi-Bit LLM Watermarking

Pāninian Foundation for Indic Language Processing

Agon: Autonomous Research System via Prompt Economy

Decoherence as Defence in Quantum Neural Networks for Intrusion Detection

SURGELLM: Task-Aware Feature Gating with Class-Balanced Normalization

Transformer Models: Architectures, Applications, and Critical Assessment

PETRA: Dataset and Pipeline for Petroleum Engineering Text Adaptation