Evaluation & benchmarks — korshunov.ai

Evaluation & benchmarks Page 1 / 44

Looped World Models Achieve 100x Parameter Efficiency

Looped World Models (LoopWM) introduce a looped architecture that iteratively refines latent environment states using a parameter-shared transformer. This approach achieves up to 100x parameter efficiency over conventional world models by adapting computation depth to each prediction's complexity.

arxiv arXiv cs.AI · 8d ago

Learning Red Agent Policy from Observations for Neurosymbolic Cyber Agents

A policy learning technique using imitation learning is proposed to predict red agent actions in partially observable cyber environments. The method learns red agent policies from network observations and defender actions, enabling neurosymbolic cyber-defense agents to accurately predict attacks and adapt defenses in diverse simulated scenarios.

arxiv arXiv cs.AI · 8d ago

EvolveNav: Self-Evolving Memory for Zero-Shot Navigation

EvolveNav introduces a self-evolving framework for zero-shot object-goal navigation that improves during test time. It uses a rule memory derived from past trajectories and a confidence-based retrieval strategy to select effective actions, reducing redundant exploration. The method achieves a 10.1% higher success rate than existing baselines with fewer unnecessary steps.

arxiv arXiv cs.AI · 8d ago

ReproRepo: Scaling Reproducibility Audits with GitHub Issues

ReproRepo introduces a scalable framework using GitHub issues to evaluate ML paper reproducibility. It shows that LLM agents like Codex with GPT-5.5 identify at least one blocker in 90% of paper-repository pairs without executing code, though exact localization remains challenging.

arxiv arXiv cs.AI · 8d ago

Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement

VERITAS introduces a generator-verifier framework that enables robots to improve policies in real time without additional training. A visual verifier evaluates actions at inference time, allowing consistent performance gains through verified rollouts that serve as effective supervision for offline policy improvement. Post-training with these verified rollouts matches expert demonstrations in efficiency, without human intervention.

Looped World Models Achieve 100x Parameter Efficiency

Learning Red Agent Policy from Observations for Neurosymbolic Cyber Agents

EvolveNav: Self-Evolving Memory for Zero-Shot Navigation

ReproRepo: Scaling Reproducibility Audits with GitHub Issues

Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement

MLLP-VRAIN's Simultaneous Speech Translation Submission for IWSLT 2026

Word2Vec's Performance in Toki Pona's Minimal Vocabulary

SpeechDx: Multi-Task Benchmark for Clinical Speech AI

LLM-Generated Stories Show Low Diversity

Implicit vs. Explicit Prompting in LVLMs for Referential Communication

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

NarrativeWorldBench and N-VSSM for Long-Horizon Audio Drama

LLM Recommendation Bias and Brand Competition Dynamics

PARSE: Real-Document Defense for LLM Agents

AIPatient Arena: EHR-grounded evaluation of LLMs in clinical workflows

STATEWITNESS: Activation Explainer for Deception Auditing in LLMs

Second-Order Bias in LLMs: Evaluating Judgment-Based Bias

Routing Accuracy Degradation and Recovery in Enterprise Agent Systems

Expressivity Analysis of Hierarchical Modelling in Deep Transformers

NAR-MBR Decoding for Fast and Accurate Speech Recognition