All articles — korshunov.ai

All articles Page 1 / 112

Running Llama 3.1 405B on a Single 8xA100 Node with Hot-Loaded LoRA Adapters

A user demonstrates successfully running the Llama 3.1 405B model quantized to AWQ-INT4 on a single node equipped with eight A100 80GB GPUs, enabling up to 30 fine-tuned specialists to be loaded and switched in under 200ms.

media r/LocalLLaMA · 3h ago

Ubuntu, CUDA, llama.cpp , nvcc versioning

A user shares their experience resolving CUDA toolkit versioning issues on Ubuntu to enable compute capabilities for newer GPUs like the Blackwell architecture and RTX 5060 Ti. The post highlights that the default apt repository provides outdated CUDA versions, necessitating manual installation of the Debian package from NVIDIA's website.

arxiv arXiv cs.LG · 4h ago

Simulation-Free Estimation of Traffic Flows from Sparse Count Data

The authors propose a method for estimating time-varying traffic flow patterns from sparse aggregated vehicle counts by partitioning the study area and solving a weighted least-squares optimization problem. This approach uses a weighted contribution matrix to encode sensor coverage, steering the optimizer toward flow configurations that are directly observable.

arxiv arXiv cs.LG · 4h ago

SQLConductor: Search-to-Policy Learning for Step-wise Text-to-SQL Orchestration

The paper introduces SQLConductor, a step-wise orchestration learning framework for Text-to-SQL that formulates subtasks as specialized actions and trains a policy model to select the next action based on intermediate artifacts and feedback.

arxiv arXiv cs.LG · 4h ago

VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

VeriEvol is an iterative framework designed to scale multimodal mathematical reasoning by decoupling prompt difficulty from answer reliability during data construction. It employs a type-aware evolution module to generate harder prompts and the HTV-Agent verifier to ensure answer correctness through multi-source counter-evidence.

arxiv arXiv cs.LG · 4h ago

The Energy Consumption of Transformer Fine-Tuning: A Roofline-Inspired Scaling Model

This article introduces a framework for modeling the energy consumption of Transformer training on multiple GPUs, aiming to address growing computational costs in sustainable system design.

arxiv arXiv cs.LG · 4h ago

SuperCond-GNN: Scalable Graph Neural Network Surrogate for Superconducting Circuit Simulations

This paper introduces SuperCond-GNN, a graph neural network surrogate model designed to predict voltage distribution in high-temperature superconducting magnets by mapping lumped-element circuits to graph representations. The model achieves a mean MAPE of 4.3% on tape stacks and enables fast inference of current redistribution across various circuit configurations.

arxiv arXiv cs.LG · 4h ago

Approximating velocity fields with planted attractors via Neural-ODEs for classification

This work employs Neural ODEs equipped with a curated collection of equilibrium points to perform classification tasks. The planted attractors serve as indicators for target classes, while the velocity field shapes the dynamical landscape to direct inputs toward their corresponding destinations.

arxiv arXiv cs.LG · 4h ago

Scheduling Thoughts: Learning the Order of Thought in Diffusion Language Models

Researchers propose Self-Aware Scheduling (SAS), a method that learns an optimal token unmasking order for masked diffusion language models to improve generation quality. By deriving a tractable upper bound on sequential decoding mismatch, the approach casts order selection as a policy optimization problem using Group Relative Policy Optimization.

media r/LocalLLaMA · 4h ago

Minimax M3 vs M2.7

A Reddit user is requesting feedback from individuals who have updated to the Minimax M3 model from version M2.7. The post seeks community insights on the differences and performance between these two iterations.

media r/LocalLLaMA · 4h ago

High-quality GLM-5.2 Quant on 4x DGX Spark - Guide, Results, and Comps

The author demonstrates running the GLM-5.2 NVFP4 model on four NVIDIA GB10 DGX Spark nodes with a 128K context window, achieving usable serving performance through aggressive system optimization.

media r/LocalLLaMA · 4h ago

MLX Fine-Tune Example Guide

A user demonstrates fine-tuning a 7B instruction model on Apple Silicon using MLX to shift its style to high-fantasy literature. The experiment shows that a small, curated dataset can significantly alter a model's register and diction with minimal computational resources.

arxiv arXiv cs.LG · 5h ago

SVD-Surgeon: Optimal Singular-Value Surgery for Large Language Model Compression

Researchers have introduced SVD-Surgeon, a training-free method that applies the Optimal Brain Surgeon framework to singular-value decomposition for compressing large language models. This approach computes closed-form updates for retained singular values to compensate for truncation errors and determines which values to prune based on saliency.

arxiv arXiv cs.LG · 5h ago

Patient-Aware Contrastive Learning Preserves Per-Patient Structure in RR-Interval Representations

The article addresses the challenge of contrastive representation learning on physiological signals where subject-specific baselines interfere with class-level objectives, causing models to lose individual variation necessary for generalization. The authors propose a patient-aware contrastive objective for Paroxysmal Atrial Fibrillation detection that forms positive pairs only from same-patient segments to preserve sinus rhythm baselines while separating classes.

arxiv arXiv cs.LG · 5h ago

A Spectral Theory of Normalized Corrected GNN Propagation

This paper develops a spectral theory for normalized corrected Graph Neural Network (GNN) propagation, focusing on the symmetric normalized adjacency matrix with its degree-stationary component removed to isolate the direction tied to oversmoothing.

arxiv arXiv cs.LG · 5h ago

MORL-A2C: Multi-Objective Reinforcement Learning Reranker for Health

Researchers introduce MORL-A2C, a sequential decision-making extension to the MOPI-HFRS system that uses an Advantage Actor-Critic algorithm to optimize the trade-off between user preference and nutritional health in food recommendations.

media r/LocalLLaMA · 5h ago

I built an agent Harness for Small Models. I got Qwen 3.5 4b managing servers.

The author developed a specialized agent harness designed to address the specific failure modes of small local models, such as failed tool calls and poor state tracking. This custom framework enables smaller models like Qwen 3.5 4b to effectively manage remote servers.

media r/LocalLLaMA · 5h ago

Locally running mode turns an Image into a Cute Controllable Character you can Play as

The author presents the 800M version of a model that converts images into controllable characters, designed to run comfortably on consumer GPUs. This iteration increases context to 12 latent frames and improves stability while maintaining high performance, achieving over 60 fps on an RTX 5090.

media Hugging Face Forums · 5h ago

HoLo-ToLk: Tokenizer-Free Speech Models on 0-Parameter HSL Substrate

The author introduces HoLo-ToLk, a research project building speech-to-text (STT) and text-to-speech (TTS) models using the zero-parameter HSL byte substrate without tokenizers or learned input embeddings. The work demonstrates that raw HSL bytes can serve as a viable signal for audio processing when combined with specific architectural modifications.

github llama.cpp · 5h ago

llama.cpp b9837 release adds --reasoning-preserve flag and new binaries

The llama.cpp project has released version b9837, which introduces a new `--reasoning-preserve` flag for the Jinja chat template to retain reasoning tokens. This update also includes corrected help messages and provides pre-built binaries for macOS, Linux, Windows, Android, and openEuler across various hardware backends.