Research paper — korshunov.ai

Topic · Research paper

A small-scale experiment shows that native binary embedding models achieve better retrieval than post-hoc binarization of float models. At SciFact Recall@10, native binary models (2048-dim and 4096-dim) outperform post-hoc binary models by 17% and 25% respectively, with significant speed and memory advantages in indexing.

arxiv arXiv cs.CL · 2d ago

OpenBioRQ: Benchmark for Agentic Biomedical Research Faithfulness

OpenBioRQ introduces a benchmark of 12,553 unsolved biomedical research questions across 12 domains, designed to test agentic models' faithfulness and abstention. It evaluates models in a tool-using setting without answer keys, using real follow-up evidence rather than parametric knowledge, and reveals significant agentic collapse on the hardest questions where tools are no longer used despite being critical.

media Hugging Face Forums · 3d ago

I built a novel triple-hybrid LLM under 1B parameters for ~$50

Mateusz has developed a full pre-trained language model, Project Inkblot's Titan v1, combining Mamba SSM, Multi-Head Attention, and 32-expert MoE in a single decoder-only architecture under 1B parameters. The model, trained on a single NVIDIA L4 GPU for ~$50, achieves 27.5 validation perplexity and demonstrates efficient scaling via a single-line config update, with all components implemented from scratch in PyTorch. Titan v2's first training cycle is now complete, and dataset expansion is underway.

arxiv arXiv cs.LG · 6d ago

LLM Alignment Using Implicit User Feedback

A new dataset, IFLLM, collects mouse trajectories and eye gazing data from users interacting with LLMs. It shows that implicit feedback significantly improves LLM alignment, boosting text-based reward model accuracy from 55% to 64% and nearly tripling response quality improvements after DPO training on eight LLMs.

arxiv arXiv cs.CL · 6d ago

LLM Alignment Using Implicit User Feedback

arxiv arXiv cs.AI · 6d ago

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization

ScaffoldAgent introduces a utility-guided framework for dynamic outline optimization in open-ended deep research. It models outline evolution through Expansion, Contraction, and Revision operations, guided by a feedback mechanism that evaluates retrieval gain, structural coherence, and generation quality. Experiments show it improves long-form report generation and factual grounding compared to existing agents.

arxiv arXiv cs.CL · 8d ago

Routing Accuracy Degradation and Recovery in Enterprise Agent Systems

As enterprise agent tool catalogs scale from 10 to 110 agents, routing accuracy drops 16--23 percentage points on under-specified requests. An oracle analysis identifies retrieval and confusion gaps, with embedding-based shortlisting recovering +10--11pp F1. A human-annotated study of 1,435 utterances confirms real-world recovery of +10--17pp despite lower absolute performance.

arxiv arXiv cs.LG · 20h ago

JS Divergence Enhances GRPO Autoregressive Text-to-Image Alignment

A study introduces JS divergence in GRPO-style autoregressive text-to-image alignment, showing it effectively balances policy optimization and generation diversity. Experiments on LlamaGen and Janus-7B demonstrate JS divergence achieves top or competitive performance across metrics while preserving diverse outputs.

arxiv arXiv cs.LG · 21h ago

Deep Learning with O(log N) Parallel Time Complexity

Hierarchical Block-Local Learning (HBLL) enables deep neural network training in O(log N) parallel time complexity, eliminating the need for full backpropagation. HBLL decomposes networks into hierarchically linked blocks and achieves competitive performance on vision and language tasks, with extensions to recurrent architectures.

arxiv arXiv cs.LG · 21h ago

Privacy-Preserving Federated Temporal Graph Learning for Cyber-Resilient IoMT

The paper introduces Federated TGCN-A2C, a privacy-preserving framework that achieves 99.48% and 99.61% test accuracy on CICDDoS 2019 and TON-IoT benchmarks, outperforming Fed-Inforce-Fusion by 0.21 percentage points. It includes anomaly detection, digital twin-based scoring, adaptive action selection, and an enhanced honeypot layer, with all major attack classes achieving F1 scores above 0.92 and 0.94, respectively, and provides post-hoc explainability via SHAP, LIME, Grad-CAM, and counterfactual analysis.

arxiv arXiv cs.CL · 22h ago

AI-PAVE-Br: LLM-Based PAVE for Brazilian E-Commerce

AI-PAVE-Br uses large language models to enhance product attribute value extraction in Brazilian e-commerce. The system outperforms traditional NER methods, with a new Golden Set dataset providing a manually annotated benchmark for Portuguese product data.

arxiv arXiv cs.CL · 22h ago

DREAM: Autoregressive Training for Dense Retrieval Embeddings

DREAM uses autoregressive next-token prediction to supervise dense retrieval embedding training. It injects query-document similarity scores into a frozen LLM's attention heads, enabling gradient backpropagation for retriever optimization. DREAM outperforms baselines on BEIR and RTEB benchmarks across model scales.

arxiv arXiv cs.CL · 22h ago

CANDLE: Lightweight Arabic Noise Deduplication via CTC

CANDLE is a lightweight system that uses Connectionist Temporal Classification to deduplicate repeated characters in Arabic text, without relying on handcrafted rules or morphological analyzers. It achieves a Sentence Error Rate of 5.37% and reduces tokenizer fertility by up to 12.8%, lowering inference costs and improving context window usage.

arxiv arXiv cs.CL · 22h ago

Micro-Transaction Markets for Verified Product Info in Agentic E-Commerce

Autonomous agents in e-commerce face a scarcity of trustworthy product information, not product matching. A proposed micro-transaction model allows agents to pay fractions of a cent to access verified data like service histories and test reports, with pricing and trust scored via reputation. This system prioritizes genuine product quality and real-time information acquisition over chatbot fluency.

arxiv arXiv cs.CL · 23h ago

L3Cube-MahaPOS: Marathi POS Tagging Dataset and BERT Models

L3Cube-MahaPOS introduces a gold-standard part-of-speech tagging dataset for Marathi, manually annotated with 32,354 sentences from news text. It includes a 16-tag Universal Dependencies scheme and benchmarks six model families, achieving 88.67% token-level accuracy and 81.67% macro-F1 on 15 tag classes using MahaBERT-v2.

arxiv arXiv cs.CL · 23h ago

Quality-Aware Training Data Selection for Scientific Summarization

We construct and release a large biomedical dataset with 1.88 million PMC articles. Analysis shows author-written abstracts vary in quality and alignment with source articles, enabling effective training-data selection. Training on high-quality subsets outperforms random sampling and matches larger random subsets on factuality metrics.

arxiv arXiv cs.CL · 23h ago

PORTER: Language-Grounded Event Representations for Portable EHR Foundation Models

PORTER introduces a language-grounded structured EHR foundation model that represents clinical events via descriptions instead of fixed vocabularies. It achieves superior performance across 74 pediatric prediction tasks and transfers effectively to new vocabularies without retraining, recovering 97.1% of target AUROC and outperforming fixed-vocabulary models on MIMIC, with 329-fold lower compute than text serialization approaches.

arxiv arXiv cs.CL · 23h ago

LoRA Monitor Calibration Fails with Top-1 in Diffusion LMs

Top-1 argmax concentration fails as a collapse warning in LoRA-optimized diffusion language models, showing zero precision across 816 configurations. Max LoRA gradient norm outperforms this baseline, achieving 0.68 precision and 0.79 F1 on a held-out LLaDA split, though results are limited to short-horizon, family-specific inspections.

arxiv arXiv cs.CL · 23h ago

Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning

HDS introduces a multi-objective reinforcement learning framework for online data mixing in LLM pre-training. It achieves 44% fewer training iterations on The Pile benchmark and improves MMLU 0-shot performance by 7.2%, with consistent gains across other benchmarks.

arxiv arXiv cs.CL · 23h ago

InterAligner: Progressive Alignment for ASR

InterAligner introduces an intermediate aligner objective and InterCTC loss to enable progressive alignment formation in deep ASR models. On LibriSpeech with a 17-layer Conformer, it reduces WER from 5.0/7.8 to 3.1/5.6, with significant improvements on long utterances.

Native binary embeddings outperform post-hoc binarization

OpenBioRQ: Benchmark for Agentic Biomedical Research Faithfulness

I built a novel triple-hybrid LLM under 1B parameters for ~$50

LLM Alignment Using Implicit User Feedback

LLM Alignment Using Implicit User Feedback

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization

Routing Accuracy Degradation and Recovery in Enterprise Agent Systems

JS Divergence Enhances GRPO Autoregressive Text-to-Image Alignment

Deep Learning with O(log N) Parallel Time Complexity

Privacy-Preserving Federated Temporal Graph Learning for Cyber-Resilient IoMT

AI-PAVE-Br: LLM-Based PAVE for Brazilian E-Commerce

DREAM: Autoregressive Training for Dense Retrieval Embeddings

CANDLE: Lightweight Arabic Noise Deduplication via CTC

Micro-Transaction Markets for Verified Product Info in Agentic E-Commerce

L3Cube-MahaPOS: Marathi POS Tagging Dataset and BERT Models

Quality-Aware Training Data Selection for Scientific Summarization

PORTER: Language-Grounded Event Representations for Portable EHR Foundation Models

LoRA Monitor Calibration Fails with Top-1 in Diffusion LMs

Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning

InterAligner: Progressive Alignment for ASR