Research paper — korshunov.ai

Topic · Research paper

A small-scale experiment shows that native binary embedding models achieve better retrieval than post-hoc binarization of float models. At SciFact Recall@10, native binary models (2048-dim and 4096-dim) outperform post-hoc binary models by 17% and 25% respectively, with significant speed and memory advantages in indexing.

arxiv arXiv cs.CL · 2d ago

OpenBioRQ: Benchmark for Agentic Biomedical Research Faithfulness

OpenBioRQ introduces a benchmark of 12,553 unsolved biomedical research questions across 12 domains, designed to test agentic models' faithfulness and abstention. It evaluates models in a tool-using setting without answer keys, using real follow-up evidence rather than parametric knowledge, and reveals significant agentic collapse on the hardest questions where tools are no longer used despite being critical.

media Hugging Face Forums · 3d ago

I built a novel triple-hybrid LLM under 1B parameters for ~$50

Mateusz has developed a full pre-trained language model, Project Inkblot's Titan v1, combining Mamba SSM, Multi-Head Attention, and 32-expert MoE in a single decoder-only architecture under 1B parameters. The model, trained on a single NVIDIA L4 GPU for ~$50, achieves 27.5 validation perplexity and demonstrates efficient scaling via a single-line config update, with all components implemented from scratch in PyTorch. Titan v2's first training cycle is now complete, and dataset expansion is underway.

arxiv arXiv cs.LG · 6d ago

LLM Alignment Using Implicit User Feedback

A new dataset, IFLLM, collects mouse trajectories and eye gazing data from users interacting with LLMs. It shows that implicit feedback significantly improves LLM alignment, boosting text-based reward model accuracy from 55% to 64% and nearly tripling response quality improvements after DPO training on eight LLMs.

arxiv arXiv cs.CL · 6d ago

LLM Alignment Using Implicit User Feedback

arxiv arXiv cs.AI · 6d ago

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization

ScaffoldAgent introduces a utility-guided framework for dynamic outline optimization in open-ended deep research. It models outline evolution through Expansion, Contraction, and Revision operations, guided by a feedback mechanism that evaluates retrieval gain, structural coherence, and generation quality. Experiments show it improves long-form report generation and factual grounding compared to existing agents.

arxiv arXiv cs.CL · 8d ago

Routing Accuracy Degradation and Recovery in Enterprise Agent Systems

As enterprise agent tool catalogs scale from 10 to 110 agents, routing accuracy drops 16--23 percentage points on under-specified requests. An oracle analysis identifies retrieval and confusion gaps, with embedding-based shortlisting recovering +10--11pp F1. A human-annotated study of 1,435 utterances confirms real-world recovery of +10--17pp despite lower absolute performance.

arxiv arXiv cs.CL · 1d ago

MorfFlex: Managing Rich Morphology in Czech

MorfFlex is a morphological dictionary architecture designed for languages with complex inflection and derivation. MorfFlex CZ, its primary implementation, contains over 100 million wordforms and more than 1 million lemmas, reduced through encoded inflectional and derivational patterns. It supports annotation consistency in the Prague Dependency Treebanks and powers tools like MorphoDiTa.

arxiv arXiv cs.CL · 1d ago

Stability of Prompt Ranking in LLM Evaluation

Prompt rankings in large language model evaluation are often unstable under minor variations like random seeds and limited subsets. A stability-aware selection strategy using lower confidence bounds improves robustness by accounting for both performance and variance, while maintaining competitiveness in stable settings.

arxiv arXiv cs.CL · 1d ago

AutoSpecNER: Fine-Grained NER Dataset for Vehicle Specifications

AutoSpecNER is a dataset of 659 car advertisements with over 10,000 entities annotated across 15 categories. It achieves 91.5% inter-annotator agreement and shows that DeBERTa outperforms both rule-based methods and large language models in vehicle specification extraction, reaching a 90% micro-F1 score.

arxiv arXiv cs.CL · 1d ago

LLM-based Two-Stage Transformer for Bearing Fault Diagnosis

A lightweight GPT-2-style Transformer enables hierarchical feature extraction from vibration signals. The framework achieves 92.61% average accuracy using only 10% labeled data, outperforming state-of-the-art methods by 17.24 percentage points in cross-domain bearing fault diagnosis.

arxiv arXiv cs.CL · 1d ago

RaDaR: AI Model Improves Rare Disease Diagnosis

RaDaR, a compact reasoning large language model, outperformed other open-source models in rare disease diagnosis. In a randomized trial, RaDaR improved physicians' diagnostic accuracy by 21.44 percentage points over internet search alone.

arxiv arXiv cs.CL · 1d ago

Poster: Exploring Audio-Based Scam Detection in Turkish

This research introduces the first public multi-modal dataset of 100 aligned audio-transcript pairs for Turkish scam and benign calls. It evaluates seven large language models under raw audio, automatic, and human-corrected transcript inputs, finding that transcript-based inputs outperform direct audio processing, with human correction having minimal impact.

arxiv arXiv cs.CL · 1d ago

AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation

AdversaBench introduces an end-to-end red-teaming pipeline that generates adversarial prompts via five structured operators, evaluates target models, and confirms failures through a three-judge panel with meta-judge tiebreaker. Experiments on 45 seed prompts across reasoning, instruction-following, and tool use show every seed produces a confirmed failure, with operator effectiveness, failure iteration counts, judge agreement, and cross-model transferability revealing key patterns in LLM vulnerability.

arxiv arXiv cs.CL · 1d ago

Qwen-AgentWorld: Language World Models for General Agents

Qwen-AgentWorld-35B-A3B and Qwen-AgentWorld-397B-A17B are the first language world models that simulate agentic environments across seven domains using long chain-of-thought reasoning. Trained via a three-stage pipeline—CPT, SFT, and RL—these models outperform existing frontier models on AgentWorldBench, a benchmark derived from real-world interactions of five models on nine established tasks.

arxiv arXiv cs.CL · 1d ago

SIFT and WSP Improve Fact-Checking Accuracy

SIFT introduces claim-conditioned re-scoring of evidence spans to better align with full claims, recovering up to 27.6 points in accuracy on FEVER, SciFact, 5PILS, and DP. WSP, an automatic NLI check, achieves AUC 0.92 and precision 0.98 when calibrating against human gold evidence.

arxiv arXiv cs.AI · 1d ago

MedLayXPlain: Benchmarking Expert-Lay Gap in Medical Vision-Language Models

MedLayXPlain introduces the first large-scale benchmark for medical lay language generation, featuring 122,789 region-grounded samples across eight imaging modalities. It evaluates medical vision-language models on expert-lay alignment using a hierarchical ontology system and a lightweight evaluator, revealing a systematic gap: expert-level performance in captioning coexists with significant degradation in lay language, while general-purpose models lack clinical precision.

arxiv arXiv cs.AI · 1d ago

QBioFusion-QSAR: Quantum Kernel Learning for Small-Data Ligand Classification

QBioFusion-QSAR integrates a quantum fidelity kernel with Morgan/Tanimoto fingerprints to improve ligand classification. On the PsychLight-A benchmark, QMKL increased accuracy and MCC compared to Morgan/Tanimoto alone, with improvements attributed to better predictions of molecules with activity cliffs, such as N-Me-5-HT and N-Me-tryptamine. Auditable analysis confirms localized quantum-kernel contributions in small-data settings.

arxiv arXiv cs.AI · 1d ago

Topological Neural Dynamics: Neuron-wise Sequence Modeling

Topological Neural Dynamics (TND) introduces a neuron-wise framework for sequence modeling, where each neuron evolves independently through a directed graph structure. In a single-player Pong behavior cloning task, TND achieves a mean of 17.47 consecutive catches per round, surpassing all baseline models by more than three times.

arxiv arXiv cs.AI · 1d ago

NASDAQ: Normalized Observation Space Dynamics-Augmented Q-Learning

NASDAQ addresses low-dimensional observation challenges in reinforcement learning by normalizing observation spaces to balance reconstruction losses across dimensions. The framework combines value learning with short-term value and next observation prediction, achieving competitive or superior performance with less training time compared to existing methods.

Native binary embeddings outperform post-hoc binarization

OpenBioRQ: Benchmark for Agentic Biomedical Research Faithfulness

I built a novel triple-hybrid LLM under 1B parameters for ~$50

LLM Alignment Using Implicit User Feedback

LLM Alignment Using Implicit User Feedback

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization

Routing Accuracy Degradation and Recovery in Enterprise Agent Systems

MorfFlex: Managing Rich Morphology in Czech

Stability of Prompt Ranking in LLM Evaluation

AutoSpecNER: Fine-Grained NER Dataset for Vehicle Specifications

LLM-based Two-Stage Transformer for Bearing Fault Diagnosis

RaDaR: AI Model Improves Rare Disease Diagnosis

Poster: Exploring Audio-Based Scam Detection in Turkish

AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation

Qwen-AgentWorld: Language World Models for General Agents

SIFT and WSP Improve Fact-Checking Accuracy

MedLayXPlain: Benchmarking Expert-Lay Gap in Medical Vision-Language Models

QBioFusion-QSAR: Quantum Kernel Learning for Small-Data Ligand Classification

Topological Neural Dynamics: Neuron-wise Sequence Modeling

NASDAQ: Normalized Observation Space Dynamics-Augmented Q-Learning