DeepSeek — korshunov.ai

Lab · DeepSeek

Contagion Networks introduces a framework to measure how evaluator biases spread among LLM agents. In a 3-agent experiment, biases propagated consistently with contagion coefficients between 0.157 and 0.352, and homogeneous-model agents showed significantly weaker contagion than cross-model setups. Increasing evaluator committee size from k=1 to k=3 reduced effective contagion by 72.4%.

arxiv arXiv cs.AI · 6d ago

Calibration Without Comprehension in LLM Vulnerability Detection

CWE-Trace evaluates eight vanilla and 15 LoRA-fine-tuned LLMs on Linux kernel vulnerability detection. Results show data contamination offers no advantage, and fine-tuning only shifts output thresholds without altering decision policies. Despite improved detection scores, LLMs lack reliable security reasoning, with top-1 CWE accuracy below 1.3% and binary detection performance at 52.1%.

arxiv arXiv cs.LG · 6d ago

Evaluator Bias Propagation in Multi-Agent LLM Systems

Contagion Networks introduces a framework to measure how evaluator biases spread among LLM agents. In a 3-agent experiment, biases propagate with coefficients between 0.157 and 0.352, and homogeneous-model agents show significantly weaker contagion than cross-model setups. Increasing evaluator committee size from k=1 to k=3 reduces effective contagion by 72.4%.

arxiv arXiv cs.AI · 6d ago

Lean as Process-Verified Reward Oracle in RL for Theorem Proving

This work shows that Lean can serve as a symbolic process oracle, providing fine-grained, verified feedback during reinforcement learning. By parsing proof attempts into tactic sequences and using Lean's elaboration to mark sound steps and first failures, the system generates dense, type-theoretic reward signals. Experiments demonstrate tactic-level supervision outperforms outcome-only methods on benchmarks like MiniF2F and ProofNet, highlighting Lean's role as both evaluator and training reward source.

arxiv arXiv cs.LG · 7d ago

Diffusion-Proof: First Framework for Diffusion LLMs in Formal Theorem Proving

Diffusion-Proof is the first framework to train and apply diffusion language models for formal theorem proving. It introduces dLLM-Prover-7B for whole-proof writing with long-range coherence and dLLM-Corrector-7- for local proof correction using bidirectional information. The framework outperforms auto-regressive LLM baselines by 1.61% on ProofNet-Test and 6.14% on MiniF2F-Test, and solves an IMO problem beyond the capability of DeepSeek-Prover-V2-7B.

arxiv arXiv cs.LG · 7d ago

Spotlight: Using Spot GPUs to Accelerate DiT RL Post-Training

Spotlight enables DiT RL post-training by leveraging idle spot GPUs, reducing costs by 1.4-6.4x while achieving superior image quality. It uses stale model weights in exploration and reconfigures sequence parallelism on-the-fly, allowing efficient GPU utilization without breaking training pipelines.

arxiv arXiv cs.CL · 7d ago

SenFlow: Advanced AI-Generated Text Detection in Hybrid Documents

SenFlow introduces a novel method for detecting AI-generated text in hybrid documents by modeling inter-sentence dependencies. It achieves state-of-the-art performance on MOSAIC, a benchmark of 16,000 documents from PubMed and XSum, with a +4.15 pp Macro-F1 gain on cross-domain transfer. SenFlow reveals that AI-generated content still exhibits generator-dependent sentence-length patterns, exploitable by sentence-level detectors despite perplexity filtering.

arxiv arXiv cs.AI · 7d ago

Spotlight: Using Spot GPUs to Accelerate DiT RL Post-Training

Spotlight enables DiT RL post-training by leveraging idle spot GPUs, reducing costs by 1.4-6.4× while achieving superior image quality. It uses stale model weights in exploration and reconfigures sequence parallelism in real time, allowing efficient GPU utilization without breaking training pipelines.

arxiv arXiv cs.CL · 8d ago

Agentic Benchmark Reveals AI Models Fail to Avoid Animal Exploitation

TAC, the first agentic benchmark for implicit animal welfare, tests AI agents' ability to avoid animal exploitation in travel booking scenarios. All seven frontier models score below 64%, with the best at 53%, and even minor prompt improvements yield only modest gains. An audit finds no signs of evaluation awareness, indicating performance gaps stem from lack of true welfare reasoning, not prompt recognition.

arxiv arXiv cs.AI · 8d ago

TAC: First Agentic Benchmark for Animal Welfare in AI

TAC evaluates whether AI agents avoid animal exploitation in travel bookings. Seven frontier models all score below 64% chance level, with Claude Opus 4.7 at 53%. Adding a welfare-aware system prompt improves performance significantly, though models show no evidence of evaluation awareness in their responses.

arxiv arXiv cs.CL · 8d ago

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

A study challenges the assumption that visual attention signals reliability in vision-language models. It finds near-zero correlation between spatial attention and accuracy, showing instead that self-consistency across reasoning paths is a stronger predictor of truth. Reliability is better explained by generation dynamics and internal state distributions, not visual attention patterns.

arxiv arXiv cs.CL · 8d ago

OPD-Evolver: On-Policy Distillation for Holistic Agent Evolving

OPD-Evolver introduces a slow-fast co-evolution framework that enables agents to select, act on, and reuse experience through on-policy self-distillation. It outperforms existing memory and training-based methods by up to 11.5% and 5.8% respectively, and demonstrates capability to challenge large-scale models like Qwen3.5-397B-A17B and Step-3.5-Flash.

media r/LocalLLaMA · 9d ago

HalBench Tests 29 Open Source Models on Sycophancy and Hallucination

HalBench evaluates 29 open-source LLMs on a custom benchmark for sycophancy and hallucination. Qwen 3.6 and Gemma 4 outperform larger models, with Qwen 3.6 achieving 36.6% pushback—higher than GPT-5.4 and Gemini 3.1 Pro. Model size does not correlate with honest responses, indicating that architecture and training data matter more than parameters.

media r/LocalLLaMA · 5d ago

The economics of AI are starting to favor open models

Recent AI model releases show that high-intelligence, low-cost models are increasingly dominated by open-weight models like DeepSeek, Qwen, GLM, Kimi, and MiniMax. For most real-world applications, the performance gap between frontier closed models and strong open models is shrinking faster than cost differences, making open models competitive in terms of both capability and price.

media r/LocalLLaMA · 6d ago

The power of intelligence is better in the hands of the people than in the board rooms of tycoons

The PearlOS project has launched an open-source swarm intelligence platform that uses local models to handle multimodal tasks. It automatically selects and switches between top-performing models based on benchmarks, ensuring users always access the latest and most capable models without relying on closed-source systems or subscriptions.

arxiv arXiv cs.AI · 7d ago

X+Slides: Benchmark for Audience-Conditioned Slide Generation

X+Slides introduces a benchmark that evaluates slide generation based on target audience needs. It uses 8,133 source-grounded probes across 113 topics and seven scenes to measure Audience Coverage, Domain-wise Coverage, Efficiency, and Correctness, revealing that current systems recover only partial audience-essential information, with DeepPresenter achieving 0.714 Audience Coverage, SlideTailor 0.594, and NotebookLM ablation 0.853, highlighting the need for source-grounded evaluation.

media r/LocalLLaMA · 7d ago

US holds off blacklisting China's DeepSeek

Sources say the U.S. has delayed blacklisting China's DeepSeek AI firm. More than 100 companies have been deemed security risks in the decision.

arxiv arXiv cs.LG · 8d ago

Baseline Evaluation of Open-Source LLMs for Multi-Label ATT&CK Classification

A ground-truth dataset of 2,076 human-annotated sentences from 83 complex CTI reports was constructed and mapped to 114 ATT&CK techniques with \k{appa} = 0.68 inter-annotator agreement. Seven open-source LLMs ranging from 8B to 236B parameters were evaluated, achieving a maximum micro-averaged F1 score of 0.22. Parameter size showed a statistically significant positive correlation with F1 score, while prompt strategy and temperature did not yield significant improvements, indicating current open-source LLMs are insufficient for production-grade ATT&CK classification.

arxiv arXiv cs.CL · 8d ago

LLMs Predict Dementia and Depression from Clinical Speech

A study uses open-weight large language models to assess dementia and depression severity from clinical interviews. LLMs achieve accurate zero-shot depression prediction (MAE 0.60) and improved dementia assessment with feature extraction (MAE 0.78), reducing errors by up to 35%. Pause-enriched transcripts match human transcriptions, supporting automated screening pipelines for neuropsychiatric disorders.

media Interconnects · 9d ago

Frontier Post-Training Recipe Review with Finbarr Timbers

The podcast reviews the evolution of post-training recipes in large language models, from InstructGPT to 2026 frontier models. It highlights Multi-Teacher On-Policy Distillation (MOPD) as the dominant pattern, where domain-specialist models are trained and then distilled into a general student model via on-policy distillation, scaling to over 10 teachers in models like DeepSeek V4 and Nemotron 3 Ultra.