All articles — korshunov.ai

All articles Page 1 / 130

GLM-5.2 Max is currently the third best model

GLM-5.2 Max is ranked as the third best model available, across both open and proprietary models. The assessment is based on performance benchmarks and current evaluations in the field of large language models.

blog Simon Willison · 15d ago

Datasette 1.0a34 Adds Row Editing and Deletion Tools

Datasette 1.0a34 introduces tools to insert, edit, and delete rows within the interface. These features are available on table pages and as action items on row pages, addressing a long-overdue capability in the UI.

media r/LocalLLaMA · 15d ago

Looking for locally hosted tool to create English subtitles from videos

A user is seeking a locally hosted, self-contained app to generate English subtitles (in .srt or .ass format) from video files. They consider Qwen-ASR and Whisper as strong options but report poor subtitle timing in ComfyUI implementations and unreliable performance with older models like those in storytoolkitAI. They ask for recommendations that work well on Windows and can handle multiple languages.

blog Simon Willison · 15d ago

click-to-play — a still that plays

The click-to-play Web Component displays a still image with a click-to-play button that loads a GIF on demand. It supports progressive enhancement, allowing GIFs to be loaded only when users interact with the image.

media Latent Space · 15d ago

GLM-5.2 Claims Top Position in Frontend Coding with Speculative Decoding

GLM-5.2, a 744B parameter model from Z.ai, has been evaluated as the top frontend coding model globally, outperforming all Opus versions including Opus 4.8. This achievement is highlighted in third-party evaluations that validate official offline tests, marking a significant milestone for a model of its size, particularly in the competitive frontend coding domain.

media r/LocalLLaMA · 15d ago

RTX 5060 Ti 16GB vs RX 9060 XT 16GB Benchmark Comparison

A benchmark comparison shows the NVIDIA RTX 5060 Ti 16GB outperforms the AMD RX 9060 XT 16GB across multiple LLM models, with higher response and prompt token speeds. Performance gains are consistent across models like Gemma3, Llama3.2, and Qwen3, with the RTX 5060 Ti showing notably faster prompt processing, especially in larger models.

media r/LocalLLaMA · 15d ago

Elias in the Lighthouse: Diagnosing Low Diversity in LLM Stories

A new study examines the limited diversity in stories generated by large language models, using the recurring character Elias in the lighthouse as a case study. The research highlights how such patterns suggest systemic biases in training data and model outputs.

arxiv arXiv cs.LG · 15d ago

LegalHalluLens: Auditing Hallucinations in Legal AI

LegalHalluLens introduces a framework to audit AI hallucinations in legal contexts by analyzing typed hallucination profiles across four claim categories. It reveals a 38-40 point gap between obligation/numeric and temporal claims, and shows two systems with identical 52% hallucination rates can have opposite risk directions. The framework uses a Risk Direction Index and calibrated debate pipelines to reduce fabricated detections by 45%, offering actionable diagnostics for trustworthy legal AI deployment.

arxiv arXiv cs.LG · 15d ago

Recursive Masked Diffusion Models Introduce New Scaling Axis

Recursive Masked Diffusion Models (R-MDMs) introduce recursive depth as a third scaling axis by reapplying a denoising transformer within each diffusion step. This recursion enables iterative output refinement without increasing parameter count, achieving performance comparable to non-recursive models with up to L times more parameters, where L is the number of iterations. R-MDMs also reduce inference compute by partially replacing denoising steps with recursive refinement.

arxiv arXiv cs.LG · 15d ago

LoopCoder-v2 Achieves Optimal Two-Loop Performance

LoopCoder-v2, a parallel loop Transformer model, achieves superior code generation and reasoning performance with two loops, improving SWE-bench Verified from 43.0 to 64.4 points and Multi-SWE from 14.0 to 31.0 points. Variants with three or more loops perform worse, indicating a non-monotonic loop-count effect due to growing positional mismatch and diminishing returns.

arxiv arXiv cs.LG · 15d ago

Catastrophic Forgetting is Low-Rank: A Function-Space Theory

A function-space theory reveals that catastrophic forgetting in continual adaptation concentrates in a small number of old-task NTK eigenmodes. In frozen-backbone linear-head PEFT-CL, the forgetting vector is exactly predictable up to numerical precision, with a Kronecker scaling rule for the vulnerable rank.

arxiv arXiv cs.LG · 15d ago

INI-VPINN: Physics-Informed Neural Network with Implicit Boundary Handling

INI-VPINN is a variational physics-informed neural network that implicitly enforces Neumann and interface conditions using compact support weighting functions and integration by parts. It achieves higher accuracy and faster convergence than existing PINN methods in solving multi-material problems with geometric singularities and mixed boundary conditions, and is publicly available on GitHub.

arxiv arXiv cs.LG · 15d ago

Baseline Evaluation of Open-Source LLMs for Multi-Label ATT&CK Classification

A ground-truth dataset of 2,076 human-annotated sentences from 83 complex CTI reports was constructed and mapped to 114 ATT&CK techniques with \k{appa} = 0.68 inter-annotator agreement. Seven open-source LLMs ranging from 8B to 236B parameters were evaluated, achieving a maximum micro-averaged F1 score of 0.22. Parameter size showed a statistically significant positive correlation with F1 score, while prompt strategy and temperature did not yield significant improvements, indicating current open-source LLMs are insufficient for production-grade ATT&CK classification.

arxiv arXiv cs.LG · 15d ago

Uncertainty Quantification for Flow-Based Vision-Language-Action Models

We propose a method using velocity-field disagreement to quantify epistemic uncertainty in flow-matching vision-language-action models. This uncertainty estimate enables failure detection during deployment and active fine-tuning via the SAVE framework, which reduces expert demonstrations by at least 22% compared to baselines, with better-calibrated predictions on the LIBERO benchmark.

arxiv arXiv cs.LG · 15d ago

ConTex: Global Counterfactual Generation for Time Series Forecasting

ConTex reformulates counterfactual generation for time series forecasting as a globally consistent intervention problem. It achieves state-of-the-art validity with sparse, interpretable interventions, reduces computational cost by 12-36x, and enables real-time inference in approximately 0.007 seconds.

arxiv arXiv cs.LG · 15d ago

ScaFE: Using LLMs to Extract Clinically Meaningful Scar Features

ScaFE repositions large language models as feature engineers for scar classification, generating executable Python code from clinical criteria to extract interpretable features. The framework achieves superior performance with limited data, preserves privacy by processing images locally, and produces clinically grounded features aligned with established scoring systems like the Vancouver Scar Scale.

arxiv arXiv cs.LG · 15d ago

NoiseTilt: Noise-Tilted Reverse Kernels for Diffusion Reward Alignment

NoiseTilt introduces NTRK, a reward-guided diffusion sampler that injects reward gradients via the noise term without altering the reverse kernel. By using a whitening operator, NTRK safely biases noise toward high reward, preserving sample quality while maintaining strong guidance. On aesthetic generation, NTRK achieves superior reward performance with 25 NFEs, reducing compute by 20× compared to state-of-the-art baselines.

arxiv arXiv cs.LG · 15d ago

Volterra Generative Models Introduce Fractional Noise for Score-Based Generation

Volterra generative models propose a continuous-time score-based framework using fractional kernels to inject path-dependent noise, avoiding memoryless noising in traditional diffusion models. The approach introduces finite-dimensional Markovian lifts and proves squared error bounds, demonstrating improved generation on MNIST and potential for natural images, with a bridge sampler enhancing stability for larger models.

arxiv arXiv cs.LG · 15d ago

Tensor-based Second-order Causal Discovery Algorithm

TSCD uses covariance matrices from observational and interventional data to identify causal structures in linear structural equation models on DAGs. It requires only uncorrelated noise and achieves identifiable causal orders and parameters with logarithmic intervention counts, scaling to hundreds of variables while remaining robust to noise and competitive with existing methods.

arxiv arXiv cs.LG · 15d ago

Edge Flow: A Continuous-Time Model for Gradient Descent at Edge of Stability

Edge Flow is a tractable, predictive continuous-time model that captures gradient descent dynamics at the edge of stability. It decomposes dynamics into center, oscillation direction, and magnitude, with self-stabilization of sharpness emerging from coupled feedback. The model requires only two gradient evaluations and one Hessian-vector product per iteration and outperforms prior models in tracking oscillations and explaining instabilities at EoS.