Research paper — korshunov.ai

Topic · Research paper

OpenBioRQ introduces a benchmark of 12,553 unsolved biomedical research questions across 12 domains, designed to test agentic models' faithfulness and abstention. It evaluates models in a tool-using setting without answer keys, using real follow-up evidence rather than parametric knowledge, and reveals significant agentic collapse on the hardest questions where tools are no longer used despite being critical.

media Hugging Face Forums · 3d ago

I built a novel triple-hybrid LLM under 1B parameters for ~$50

Mateusz has developed a full pre-trained language model, Project Inkblot's Titan v1, combining Mamba SSM, Multi-Head Attention, and 32-expert MoE in a single decoder-only architecture under 1B parameters. The model, trained on a single NVIDIA L4 GPU for ~$50, achieves 27.5 validation perplexity and demonstrates efficient scaling via a single-line config update, with all components implemented from scratch in PyTorch. Titan v2's first training cycle is now complete, and dataset expansion is underway.

arxiv arXiv cs.LG · 6d ago

LLM Alignment Using Implicit User Feedback

A new dataset, IFLLM, collects mouse trajectories and eye gazing data from users interacting with LLMs. It shows that implicit feedback significantly improves LLM alignment, boosting text-based reward model accuracy from 55% to 64% and nearly tripling response quality improvements after DPO training on eight LLMs.

arxiv arXiv cs.CL · 6d ago

LLM Alignment Using Implicit User Feedback

arxiv arXiv cs.AI · 6d ago

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization

ScaffoldAgent introduces a utility-guided framework for dynamic outline optimization in open-ended deep research. It models outline evolution through Expansion, Contraction, and Revision operations, guided by a feedback mechanism that evaluates retrieval gain, structural coherence, and generation quality. Experiments show it improves long-form report generation and factual grounding compared to existing agents.

arxiv arXiv cs.CL · 8d ago

Routing Accuracy Degradation and Recovery in Enterprise Agent Systems

As enterprise agent tool catalogs scale from 10 to 110 agents, routing accuracy drops 16--23 percentage points on under-specified requests. An oracle analysis identifies retrieval and confusion gaps, with embedding-based shortlisting recovering +10--11pp F1. A human-annotated study of 1,435 utterances confirms real-world recovery of +10--17pp despite lower absolute performance.

arxiv arXiv cs.CL · 2d ago

Entity-level Membership Inference via LLM Interrogation

Researchers propose entity-level membership inference to determine if an LLM has been exposed to information about a real-world entity during training. By constructing prompts with limited entity clues and analyzing semantic features in generated responses, their five interrogation strategies achieve up to 0.97 AUC and improve Balanced Accuracy by 6.0%–17.5% over adapted baselines on person entities.

arxiv arXiv cs.CL · 2d ago

Language-Model Panel for Political Position Measurement in Data-Sparse Regions

A new method uses large language models as fallible raters in a panel to measure political positions in regions with sparse data. Adding written axis definitions improves score consistency and agreement among raters, while Krippendorff's alpha of 0.86 indicates high reliability across models and labs. Disagreements highlight interpretive issues, suggesting the method detects referent problems rather than measurement errors.

arxiv arXiv cs.CL · 2d ago

AI Recommendation Ownership: Empirical Map of Brand Category Ownership

A study of 3,750 queries across five industries finds moderate recommendation concentration, with a mean Gini coefficient of 0.28. Cross-model agreement on top-recommended brands was only 41.6%, and displacement scores varied by industry, ranging from 0.4:1 to 4.3: 1. The results challenge the 'winner-takes-all' narrative and introduce three reproducible metrics for competitive-intelligence analysis.

media Hugging Face Forums · 2d ago

Coolest Theoretical AI Topics with Realistic AI System Basis

The discussion explores theoretical AI topics that have mathematical foundations and plausible implementation in current AI systems, such as large language models. Topics include reasoning chains, knowledge graphs, and probabilistic reasoning, all of which are grounded in formal math and show potential for real-world AI applications.

arxiv arXiv cs.CL · 2d ago

Flow-Matching TTS Model Simulates Lombard Effect

A flow-matching based text-to-speech model is introduced to simulate the Lombard effect, where humans speak louder and clearer in noisy environments. The model enables continuous, disentangled control of vocal effort and articulation, with word-level emphasis for clarity. Experiments show improved acoustic clarity and intelligibility in noisy conditions compared to baseline systems.

arxiv arXiv cs.CL · 2d ago

KDoS: Distribution-Optimized Synthesis for LLM Knowledge Expansion

KDoS introduces knowledge density to guide synthetic data generation through a three-stage feedback mechanism. Experiments on models from 0.6B to 16B and data scales from 1B to 5B tokens show that an optimal knowledge distribution consistently maximizes knowledge boundary expansion, is stable across model backbones, and outperforms baselines on six knowledge benchmarks.

arxiv arXiv cs.CL · 2d ago

CTC Oracle Gap: Acoustic Exhaustion and Linguistic Recovery

CTC-internal scoring shows no WER improvement over greedy decoding on LibriSpeech, with acoustic confidence failing to correlate with linguistic plausibility. MBR decoding using RoBERTa PLL achieves a 5.42% WER, outperforming greedy decoding by 0.535 pp, demonstrating that linguistic information can overcome CTC's saturation limit.

arxiv arXiv cs.CL · 2d ago

Tmax: A Simple RL Recipe for Terminal Agents

Tmax presents the strongest open RL recipe for terminal agents, achieving 27% on Terminal-Bench 2.0 with only 9B parameters. It uses a novel data taxonomy to generate over 2.5x more terminal environments than prior datasets, enabling efficient training with a simple, outcome-only recipe. The dataset, models, and code are open-sourced at https://github.com/hamishivi/tmax.

arxiv arXiv cs.CL · 2d ago

WaveDetect: Framework for Machine-Generated Text Detection via Wavelet Transform

WaveDetect introduces a signal processing approach using continuous wavelet transforms to detect machine-generated text by identifying spectral fingerprints. It outperforms existing methods in accuracy and robustness across adversarial attacks, domain shifts, and evolving LLMs, demonstrating strong generalization on RAID, EvoBench, and Domain-Shift datasets.

arxiv arXiv cs.CL · 2d ago

Do LLM Embedding Spaces Recover Expert Structure?

Pretrained LLM embeddings show measurable alignment with expert-defined mental-health symptom structure. Fine-tuning enhances this alignment, especially at fine category levels, with larger model sizes improving both zero-shot performance and supervised gains. Residual alignment persists after controlling for linguistic and stylistic confounds, indicating expert structure recovery is level-dependent and requires explicit confound testing.

arxiv arXiv cs.CL · 2d ago

Militarized Language Rising in Scientific Abstracts

Between 2010 and 2025, militaristic terms in scientific abstracts increased by 48% in OpenAlex and 32% in PubMed, with a sharp rise after 2019. The use of such language is aligned with global conflict levels and grows fastest in Global South publications, particularly in social sciences and engineering. A controlled experiment shows that war-framing reduces perceived credibility, funding willingness, and policy support, with a minor increase in urgency.

arxiv arXiv cs.CL · 2d ago

SVD-Surgeon: Optimal Singular-Value Surgery for LLM Compression

SVD-Surgeon is a training-free method that applies the Optimal Brain Surgeon framework to singular-value decomposition. It computes a closed-form update for retained singular values to compensate for truncation, improving the perplexity-compression trade-off on OPT and LLaMA 2-7B models without retraining.

arxiv arXiv cs.CL · 2d ago

Tapered Language Models Improve Performance

Tapered Language Models (TLMs) allocate more parameters to earlier layers and fewer to later ones, reducing perplexity and boosting benchmark performance across architectures. This depth-aware capacity allocation improves language model outputs without adding compute or parameters, offering a simple, universal design principle.

arxiv arXiv cs.CL · 2d ago

NL2Scratch: Executable Benchmark for NL-to-Scratch Generation

NL2Scratch introduces an executable benchmark with 311,648 parser-valid NL-program pairs derived from real Scratch projects. It proposes Semantic Alignment Consistency (SAC) to measure semantic agreement, validating 23,594 examples and creating an 800-slot-balanced diagnostic benchmark. Experiments show a significant gap between lexical similarity and semantic alignment, with models achieving high token-level F1 often failing to reach perfect SAC, especially on longer examples.

OpenBioRQ: Benchmark for Agentic Biomedical Research Faithfulness

I built a novel triple-hybrid LLM under 1B parameters for ~$50

LLM Alignment Using Implicit User Feedback

LLM Alignment Using Implicit User Feedback

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization

Routing Accuracy Degradation and Recovery in Enterprise Agent Systems

Entity-level Membership Inference via LLM Interrogation

Language-Model Panel for Political Position Measurement in Data-Sparse Regions

AI Recommendation Ownership: Empirical Map of Brand Category Ownership

Coolest Theoretical AI Topics with Realistic AI System Basis

Flow-Matching TTS Model Simulates Lombard Effect

KDoS: Distribution-Optimized Synthesis for LLM Knowledge Expansion

CTC Oracle Gap: Acoustic Exhaustion and Linguistic Recovery

Tmax: A Simple RL Recipe for Terminal Agents

WaveDetect: Framework for Machine-Generated Text Detection via Wavelet Transform

Do LLM Embedding Spaces Recover Expert Structure?

Militarized Language Rising in Scientific Abstracts

SVD-Surgeon: Optimal Singular-Value Surgery for LLM Compression

Tapered Language Models Improve Performance

NL2Scratch: Executable Benchmark for NL-to-Scratch Generation