korshunov.ai — ML news

Results

Sort

Lab Topic Source

Claude v2.1.178 Release Notes

Claude v2.1.178 introduces new permission rules using Tool(param:value) syntax, improved workflow and skill loading in nested directories, and enhanced auto mode and error messaging. It fixes critical issues including crashes, authentication errors, and UI behavior in Chrome and VSCode, while refining tool prompts and undo functionality.

arxiv arXiv cs.CL · 10d ago

SuCo: Sufficiency-guided Continuous Adaptive Reasoning

SuCo introduces Minimal Sufficient CoT (MSC) as the shortest reasoning prefix adequate for correct answers. It employs a two-stage training framework—MSC-Aligned Fine-Tuning and Sufficiency-Aware Policy Optimization—to reduce reasoning length while maintaining or improving accuracy across math, code, and science tasks.

arxiv arXiv cs.CL · 10d ago

Vision-language models don't always need images for chest X-ray accuracy

A causal audit shows that text-only models match multimodal models in chest radiography accuracy. Across nine systems, a text-only model performs within 5.7 points of the best multimodal model, and a 119-billion-parameter model is indistinguishable from a 7-billion-parameter text-only baseline. Grounding audits, not accuracy, should determine clinical deployment.

arxiv arXiv cs.CL · 10d ago

Automated Prompt Optimization for LLM Game Agents

A new framework automates prompt refinement for LLM agents by splitting the observation-to-action pipeline into goal-conditioned and action selection modules. It uses an LLM-driven evolutionary loop to iteratively improve prompts based on environment feedback, achieving up to 72.5% success on PutNext where prior agents failed, without model fine-tuning.

arxiv arXiv cs.CL · 10d ago

Dynamic Rollout Editing Reduces Overthinking in RL-Trained Reasoning Models

Dynamic Rollout Editing (DRE) addresses overthinking in RL-trained reasoning models by modifying successful trajectories post-answer emergence. DRE preserves the correct reasoning prefix while editing unnecessary continuation, weakening the credit assigned to redundant thinking without penalizing valid reasoning. Experiments across diverse tasks demonstrate its effectiveness in reducing overthinking.

media r/LocalLLaMA · 10d ago

GLM-5.2 crosses 80% on Terminal-Bench

GLM-5.2 is the first open-weights model to achieve 80% accuracy on Terminal-Bench and outperforms all other available open models. It also surpasses Gemini, positioning it as a frontier-level model at a significantly lower cost.

github llama.cpp · 10d ago

llama.cpp releases b96669 with backend sampling for Eagle3

llama.cpp version b9669 adds backend sampling support for Eagle3. The release includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, including Vulkan, CUDA, ROCm, OpenVINO, and SYCL.

github llama.cpp · 10d ago

llama.cpp Release b9670: Fixes and New Builds

llama.cpp release b9670 includes fixes for NVFP4 edge cases in llama-graph, such as moving post-GEMM MUL operations and restricting build_ffn to supported combinations. The release provides binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and backend options, including CUDA, Vulkan, SYCL, and OpenVINO.

github llama.cpp · 10d ago

llama.cpp Release b9667 Adds Vulkan and CUDA Support

llama.cpp release b9667 introduces Vulkan support with S_v=16 via gated_delta_net. It includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures, with options for Vulkan, CUDA 12.4 and 13.3, ROCm, OpenVINO, and SYCL.

github llama.cpp · 10d ago

llama.cpp release b9665 adds --offline flag and new binary builds

llama.cpp version b9665 introduces a new --offline flag for benchmarking. The release includes binary builds for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, including Vulkan, CUDA, ROCm, OpenVINO, and SYCL.

github llama.cpp · 10d ago

LLaMA.cpp Release b9663 Adds SYCL Support and New Binary Builds

LLaMA.cpp release b9663 adds support for OP EXPM1 and all unit test cases for FLOOR, TRUNC, and ROUND. It includes updated binaries for macOS, Linux, Android, Windows, and openEuler, with support for SYCL (FP32 and FP16), Vulkan, CUDA 12.4 and 13.3, and ROCm 7.2, along with an updated UI.

github llama.cpp · 10d ago

sycl: support reordered Q4_K/Q5_K/Q6_K MoE MUL_MAT_ID

The sycl update extends support for reordered expert tensor handling in MoE MUL_MAT_ID to Q4_K, Q5_K, and Q6_K. Unsupported 3D reorder cases now fallback instead of aborting.

github llama.cpp · 10d ago

Vulkan adds col2im_1d op and supports multiple platforms

The llama.cpp release b9661 adds GGML_OP_COL2IM_1D support for Vulkan, using a bounded gather loop instead of full-K scan with modulo. It returns nullptr for unsupported types and includes builds for macOS, Linux, Android, Windows, and openEuler across CPU, Vulkan, CUDA, and SYCL.

arxiv arXiv cs.CL · 10d ago

LOGOS: A General-Purpose Generative Model for Natural Sciences

LOGOS is a unified generative language model that represents scientific objects and their interactions as token sequences in a shared grammar. It achieves consistent or superior performance across diverse natural science tasks, demonstrating the feasibility of a single model serving multiple domains. The model scales positively with parameter count, and its design suggests that AI for Science should align deeply with large language models through shared architectures and training.

arxiv arXiv cs.CL · 10d ago

TokenPilot: Cache-Efficient Context Management for LLM Agents

TokenPilot reduces inference costs by 61% to 87% in both isolated and continuous modes, outperforming prior systems in cost efficiency while maintaining competitive performance. It uses ingestion-aware compaction and lifecycle-aware eviction to preserve prompt cache continuity and minimize token footprints.

arxiv arXiv cs.CL · 10d ago

DeepRubric: Efficient RL for Deep Research Agents

DeepRubric introduces a data construction framework that builds query-rubric pairs by first defining verifiable evaluation targets through an evidence tree. It generates 9K supervision examples and trains a 8B model with GRPO, achieving performance comparable to state-of-the-art models using 13x fewer RL GPU-hours.

arxiv arXiv cs.CL · 10d ago

KVEraser: Efficient Localized Context Erasing in LLMs

KVEraser enables efficient localized context erasing in large language models by replacing only the KV cache states of an erased span with learned steering states. It achieves near-full-recomputation performance on in-domain tasks across 1K to 32K context lengths, with only a 24% latency increase, and outperforms other approximate methods in long-document QA with 3--4x speedup over full recomputation.

arxiv arXiv cs.CL · 10d ago

MetaSyn: Benchmarking LLM Agents on Meta-Analysis Articles

MetaSyn introduces a dataset of 442 expert-curated meta-analyses from Nature Portfolio. It evaluates twelve LLM agent configurations and reveals a critical bottleneck in study screening, where no system recovers more than 52.7% of ground-truth included literature despite high retrieval recall.

arxiv arXiv cs.CL · 10d ago

ContextRL: Context-Aware RL for LLMs

ContextRL introduces an indirect auxiliary objective to improve long-horizon reasoning and multimodal performance in LLMs. It rewards models for selecting the context that supports a query-answer pair, using contrastive context data from coding agent trajectories and image-based visual questions. ContextRL achieves +2.2% and +1.8% gains over standard methods on long-horizon and visual QA benchmarks, with gains attributed to the selection objective, not data augmentation.

arxiv arXiv cs.CL · 10d ago

Language Models Encode Value of Their Current Trajectory

Qwen3-8B internally tracks the value of its current trajectory, defined as the likelihood of achieving its goals. This 'value' axis distinguishes confidence levels, backtracking behavior, and code correctness, and shows that preference optimization boosts confidence in rewarded behaviors. The model assigns low value to politically sensitive queries post-training, and fine-tuning increases confidence within specific domains.