All articles — korshunov.ai

All articles Page 1 / 123

On the Stability of Prompt Ranking in Large Language Model Evaluation

This paper systematically studies the stability of prompt rankings under common variability sources like random seeds and limited evaluation subsets across three open-weight LLMs and two benchmark tasks.

arxiv arXiv cs.AI · 8h ago

Cycle-Consistent Neural Explanation of Formal Verification Certificates

Researchers propose a cycle-consistent neural architecture that generates faithful natural language explanations for formal verification certificates, addressing the opacity of these machine-checkable proofs for non-specialists. The system achieves 90.0% cycle-verified soundness on test data from a financial compliance domain, significantly outperforming multi-LLM baselines in both accuracy and inference speed.

media r/LocalLLaMA · 8h ago

Ornith 35B works reasonably well with Qwen3.6 35B DFlash speculative model

A user reports achieving a 30-40% increase in token generation speed by pairing the Ornith-1.0-35B model as a draft model with Qwen3.6-35B-A3B-DFlash using llama-server.

arxiv arXiv cs.AI · 9h ago

PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models

Researchers have introduced PHANTOM, a large-scale, open-source dataset containing 47,524 pre-generated adversarial attacks designed to evaluate the safety and robustness of vision-language models (VLMs). This resource consolidates and extends prior benchmarks by covering 10 high-level categories and 55 subcategories of harmful intents, aiming to lower the computational barriers for adversarial research.

arxiv arXiv cs.AI · 9h ago

Female-RHINO: Real-Time Scanner-Integrated Framework for Automated Uterine MRI Analysis

This article introduces Female-RHINO, a real-time AI-assisted framework that integrates with MRI scanners to perform automated quantitative uterine analysis and structured reporting during image acquisition. The system combines deep learning models for segmentation and landmark detection to derive biomarkers from sagittal T2-weighted pelvic MRI without manual interaction.

arxiv arXiv cs.AI · 9h ago

Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability

The authors introduce Age of LLM, a turn-based 1v1 benchmark where two large language models compete on a 13x7 grid to destroy an enemy base under conditions of fog of war and full diplomacy. This private engine mitigates data contamination by using fresh random map seeds and opponents for each match.

arxiv arXiv cs.AI · 9h ago

ATRIA: Adaptive Traceable ECG Reporting with Iterative Agents

The article introduces ATRIA, a multi-agent system for ECG reporting that addresses the limitations of existing end-to-end models and single-pass agents by mirroring the clinician's iterative workflow.

arxiv arXiv cs.AI · 9h ago

Average Rankings Mask Per-Subject Optimality: A Friedman-Nemenyi Benchmark of EEG Motor-Imagery BCI Decoders

This study evaluates whether any single decoding pipeline dominates across subjects in motor imagery brain-computer interfaces by testing 1,056 configurations on three public datasets using rigorous statistical benchmarks.

arxiv arXiv cs.AI · 9h ago

Entity Resolution via Batched Oracle Queries

This article addresses the problem of resolving entities in large datasets using an oracle that clusters records in limited batches, aiming for a pay-as-you-go approach to control costs while maximizing recall.

arxiv arXiv cs.AI · 9h ago

Agentic AI for Bilevel Long-Term Optimization of Policy-Driven Physical Layer Systems

This paper introduces Agentic-LTPO, a nested bilevel optimization framework designed to address the limitations of fixed-objective methods in physical layer systems facing dynamic operator policies and real-time constraints. The framework utilizes agentic AI to generate upper-level configurations that translate evolving policies and historical experiences into structured lower-level problems for immediate decision-making.

media r/LocalLLaMA · 9h ago

Second Circuit: An NGO for digital freedom of thought

Chris Tidesson announces the founding of Second Circuit, an NGO dedicated to supporting self-determined AI use and encouraging open-source software adoption among governments, companies, and private individuals. The organization was originally established in response to the ChatGPT 4o situation and currently operates a Discord community for over six months.

media r/LocalLLaMA · 9h ago

on Dario’s statement

This Reddit post from the r/LocalLLaMA community discusses a statement made by Dario Amodei. The content is limited to the title and metadata, with no detailed text or analysis provided in the source.

arxiv arXiv cs.AI · 10h ago

Can Aggregate Invariants Accelerate Continuous Subgraph Matching? Limits, Laws, and a Dynamic Spectral Index

This study evaluates whether spectral filtering can accelerate continuous subgraph matching (CSM) on dynamic graphs, finding that while lazy maintenance is ineffective, selective exact maintenance offers significant performance gains.

arxiv arXiv cs.AI · 10h ago

Detecting AI Coding Agents in Open Source: A Validated Multi-Method Census of 180 Million Repositories

A multi-layered detection framework analyzing 180 million Git repositories reveals that single-signal methods significantly underestimate the prevalence of generative AI coding agents, missing up to 97% of activity. The study identifies over 320,000 commits per month from agents like Claude Code, which dominates silent adoption through configuration files rather than bot accounts.

arxiv arXiv cs.AI · 10h ago

Transformation Behavior of Images in Latent Space

This paper investigates how classical image transformations affect embeddings in latent space using encoder networks from Lunit Inc., Bioptimus, and Meta Research Team.

arxiv arXiv cs.AI · 10h ago

MedPCFM: Improving Medical Point Cloud Completion by Integrating Point Transformers and Flow Matching

This article introduces PCFM, a flow matching approach for medical point cloud completion that integrates Point Transformer v3 (PTv3) to address insufficiently studied generative modeling in this domain. The method is evaluated on the SkullFix, SkullBreak, and Mandibular Defect datasets against strong deterministic and diffusion baselines.

arxiv arXiv cs.AI · 10h ago

ReM-MoA: Reasoning Memory Sustains Mixture-of-Agents Scaling

The authors propose ReM-MoA, a memory-augmented Mixture-of-Agents framework designed to sustain performance gains as model depth increases, addressing the degradation and saturation issues found in existing variants. The system utilizes a Ranked Reasoning Memory and a Curated Diversified Memory Routing scheme to preserve exploration diversity while propagating high-quality reasoning traces across layers.

arxiv arXiv cs.AI · 10h ago

NoContactNoWorries: Estimating Contact through Vision and Proprioception for In-Hand Dexterous Manipulation

Researchers propose NoContactNoWorries, a transformer-based framework that infers binary contact states during in-hand manipulation by fusing RGB-D vision with robot proprioception. This approach serves as a scalable pseudo-tactile signal, avoiding the cost and fragility associated with dedicated hardware tactile sensors.

arxiv arXiv cs.AI · 10h ago

Bayesian control for coding agents

This article introduces a Bayesian controller for orchestrating modern coding agents, addressing the limitations of fixed-rule systems that ignore uncertainty during tool use.

media r/LocalLLaMA · 10h ago

What happened to Petals (Decentralized Inference) by BigScience?

The provided source content is a Reddit submission link and does not contain the article text or discussion details.