Research paper — korshunov.ai

Research paper Page 1 / 18

No-Free-Fairness: Fundamental Limits in Learning Systems

The paper introduces 'No-Free-Fairness' theorems that prove three fundamental limits in learning systems. These include inherent fairness-cost trade-offs, unavoidable subgroup disparity in finite samples, and model expressivity constraints that prevent fairness regardless of data. The results show fairness is constrained by problem structure, data limits, and model capacity, not just biased data.

arxiv arXiv cs.LG · 8d ago

Meta-classification of one-class models via ranking and nearest neighbor

This paper proposes a meta-classification method for one-class classification models by representing them as normality rankings and using ranking correlation and nearest neighbor metrics. The approach achieves high accuracy in classifying models based on training datasets, algorithms, and hyperparameters, and works even when datasets share the same class. The method effectively classifies datasets by treating multiple samples as a single input, offering a unified solution for OCC models, datasets, and rankings.

arxiv arXiv cs.LG · 8d ago

McWC: Forecasting with Cyclicity, Trend, and Channel Correlation

McWC introduces a model that separately captures cyclicity, trend, and inter-channel correlations in long-term time series forecasting. It uses multi-layer cyclicity construction, wavelet decomposition, and a multi-layer perceptron to extract and fuse high- and low-frequency information, while decoupling intra-channel autocorrelations via frequency-domain loss. Experiments on six real-world datasets show McWC achieves state-of-the-art performance with high computational efficiency.

arxiv arXiv cs.LG · 8d ago

BLITZ: Fast and Calibrated Nonparametric Conditional Independence Test

BLITZ introduces a two-stage regression method for nonparametric conditional independence testing. It first removes broad smooth dependencies using polynomial regression, then applies shallow tree regressions to residualize nonlinear features, enabling accurate and fast testing with improved null calibration compared to existing methods.

arxiv arXiv cs.AI · 8d ago

McWC: Forecasting with Cyclicity, Trend, and Channel Correlation

arxiv arXiv cs.AI · 8d ago

Security and Privacy Prompts in User-LLM Conversations

A study of 14,727 security and privacy prompts from 3.2M real-world user-LLM conversations identifies nine categories of S&P questions. Thematic analysis and response testing show commercial LLMs outperform open models, with GPT 5.5 providing good responses on 98% of prompts versus Llama 4 at 47%, though some commercial models produce inconsistent responses across runs.

arxiv arXiv cs.AI · 8d ago

First Proof Second Batch: AI Tested on Research-Level Math Problems

A study evaluated several AI systems on ten research-level mathematics problems created by prominent mathematicians. The results include AI-generated solutions, human solutions, and referee reports, offering a detailed assessment of AI performance in solving advanced mathematical problems.

arxiv arXiv cs.CL · 8d ago

Can Language Models Discover Zero?

Language models of GPT-2 size cannot independently discover zero during testing, regardless of pretraining. However, performance improves significantly with training on tens to hundreds of zero examples, and language pretraining reduces required examples by about 50%.

arxiv arXiv cs.CL · 8d ago

Routing Accuracy Degradation and Recovery in Enterprise Agent Systems

As enterprise agent tool catalogs scale from 10 to 110 agents, routing accuracy drops 16--23 percentage points on under-specified requests. An oracle analysis identifies retrieval and confusion gaps, with embedding-based shortlisting recovering +10--11pp F1. A human-annotated study of 1,435 utterances confirms real-world recovery of +10--17pp despite lower absolute performance.

arxiv arXiv cs.CL · 8d ago

Prompt Perturbation for Reliable LLM Evaluation

A new framework uses prompt perturbation to identify and filter structurally inconsistent pairwise comparisons in large language model evaluations. By incorporating graph-level consistency checks before ranking aggregation, the method reduces cyclic preferences and improves the reliability of LLM rankings.

arxiv arXiv cs.CL · 9d ago

A Framework for Evaluating Agentic Skills at Scale

We present a framework for evaluating agentic skills by constructing realistic tasks and assessing skill utility through task execution. Applied to 500 real-world skills, it generates 1,000 tasks and scoring rubrics, evaluating 19 agent-model configurations across proprietary and open-source models. Results show significant variation in instruction adherence and performance gains, with skills substantially altering model behavior compared to no-skill setups.

arxiv arXiv cs.CL · 9d ago

Bilingual fine-tuning improves low-resource ASR with language identification

A study finds bilingual fine-tuning enhances automatic speech recognition in low-resource languages when language identification is accurate. Including a language identification token at inference improves ASR performance when identification accuracy is low, especially in diverse language pairs across different families and writing systems.

arxiv arXiv cs.CL · 9d ago

Non-negative Elastic Net Decoding for Information Retrieval

NNN decoding selects documents as a joint set that jointly reconstructs the query embedding via a sparse non-negative linear combination. It strictly extends dense retrieval by handling queries that dense retrieval fails on, especially in corpora with correlated documents, and achieves superior performance through end-to-end training of embeddings.

media r/LocalLLaMA · 9d ago

Glimmer 1: A 10,000-parameter foundational language model

Glimmer 1 is a 10,000-parameter language model trained on 500K tokens from FineWeb-Edu. It features a 512-token context window, a standard Llama architecture with 16 hidden dimensions, 2 layers, 4 attention heads, and 1 KV head using GQA, and is available on Hugging Face.

arxiv arXiv cs.CL · 9d ago

Post-Hoc Operators Fail to Improve Accuracy in Small Code Models

A measurement study finds that 26 semantic post-hoc operators do not improve held-out accuracy over Best-of-N in frozen small code models. While two operators—expression-layer recovery and adaptive consensus early-stop—offer benefits in compute efficiency or program recovery, none outperform BoN in accuracy. The results highlight systemic limitations in error detection and coverage, suggesting that model harnesses and error coverage must be improved before post-hoc reasoning is considered.

arxiv arXiv cs.AI · 9d ago

IMPACTeen Dataset Released with English and Polish Versions

IMPACTeen is a dataset of 1,021 texts annotated from five perspectives—teenagers, parents, psychologists, communication experts, and teachers. It includes 5,100 annotation records covering social influence techniques, intentions, consequences, and resistance, with annotations validated through human editing. The dataset, created using LLM generation and human validation, is available in both Polish and English and supports research on social influence and language model training.

arxiv arXiv cs.AI · 9d ago

MA-SBI: Calibration-Free SBI via Side-Channel Guidance

MA-SBI introduces a calibration-free simulation-based inference framework that uses side-channel text, like regime labels or instructions, to correct for simulator misspecification. It employs a learned corrector to apply observation-space shifts before posterior inference, without needing ground-truth parameter pairs or retraining. On hide-the-calibration benchmarks, MA-SBI matches the oracle posterior with text alone, outperforming RoPE under limited data, and shows robustness on real-world epidemiological and cognitive-science datasets.

arxiv arXiv cs.AI · 9d ago

AI research documentation improves over decade

Analysis of 56,800 AI conference papers shows documentation practices improved from 2014 to 2024. Papers sharing both code and data increased from 11% to 64%, and estimated reproducibility rose from 28% to 64%. These improvements predate formal reproducibility checklists, indicating a broader shift toward open science.

arxiv arXiv cs.AI · 9d ago

AI-Enabled Progress in Stable Menus of Public Goods

Experiments on EC 2025's 'Stable Menus of Public Goods' show that human-intuition prompts improve LLM performance and multi-turn interactions enhance ambitious steps. However, when compared to a first-year PhD student using an unpublished manuscript, the LLM is found to be slightly less effective.

arxiv arXiv cs.AI · 9d ago

Bayesian Audits Reveal Inconsistent AI Evaluation Timelines

Public AI evaluation archives show that a single terminal result can arise from two distinct pre-terminal histories, with estimated times to reach 95% of performance ceilings at 23.03 or 75.13. A candidate selection-aware frontier model fails synthetic recovery and uncertainty calibration, and is rejected by fixed audit gates. An archive-and-adjudication protocol verifies timing boundaries and falsifies unsupported frontier claims.