Inference efficiency — korshunov.ai

Inference efficiency Page 8 / 9

Variable-Width Transformers Outperform Uniform Architectures

A new \times-shaped transformer architecture allocates varying layer widths, widening early and late layers while narrowing middle ones. It reduces average layer width, leading to 22% fewer FLOPs and 15% less KV cache memory, while outperforming uniform baselines on language modeling loss across 200M to 2B parameter models.

arxiv arXiv cs.LG · 8d ago

MGUP: Momentum-Gradient Alignment for Selective Optimization

MGUP introduces a selective update mechanism that applies larger step-sizes to a fixed proportion of parameters in stochastic optimization, while using smaller, non-zero step-sizes for the rest. It integrates seamlessly with optimizers like AdamW, Lion, and Muon, providing theoretical convergence guarantees for MGUP-AdamW and demonstrating superior or more stable performance in training large language models and MAE pretraining tasks.

arxiv arXiv cs.LG · 8d ago

AoiZora: Topology-Aware Auto-Parallel Optimization for Video Diffusion Inference

AoiZora is a compiler-mediated topology planner that improves low-latency video diffusion inference on TPU sub-slices. By aligning logical sharding with physical placement through the compilation flow, it reduces one-step denoising latency by up to 1.42x on TPU v5e sub-slices compared to existing methods.

arxiv arXiv cs.AI · 8d ago

S4oP: Operator-level Pruning for Efficient SSM Deployment

S4oP introduces an incremental, operator-level pruning method for S4 and S4D models, reducing inference cost by up to 70% while maintaining performance. The approach combines structured masking with fine-tuning and jointly tracks accuracy and latency, enabling efficient deployment of SSMs on resource-constrained devices.

arxiv arXiv cs.AI · 8d ago

Ternary Mamba: Pretrained QAT for Efficient SSM Compression

Ternary Mamba achieves 3.61x compression of Mamba-2 using grouped quantization-aware training from a pretrained checkpoint, reducing memory from 2,687 to 744 MB. It reaches 48.1% zero-shot accuracy with only 102M tokens and 4 GPU-hours, matching Bi-Mamba within 0.9 percentage points, while revealing new instability from learnable quantization scales and error accumulation in recurrence.

arxiv arXiv cs.AI · 8d ago

Embedded ML Workflow for Microcontroller Edge Devices

This paper outlines a systems-oriented workflow for embedded machine learning on microcontroller-class devices. It details key engineering decisions such as data sampling, feature extraction, class imbalance validation, model-runtime co-design, and streaming deployment, using inertial motion recognition and keyword spotting as case studies. The work provides practical design rules for robust on-device inference, including data curation, quantization, thresholding, scheduling, and field monitoring.

arxiv arXiv cs.CL · 8d ago

SwiftTrans Improves LLM Code Translation Efficiency

SwiftTrans addresses runtime efficiency gaps in LLM-based code translation by introducing Multi-Perspective Exploration and Difference-Aware Selection. The framework extends CodeNet, F2SBench, and introduces SwiftBench to evaluate runtime performance, showing consistent improvements in both correctness and efficiency across benchmarks.

media r/LocalLLaMA · 8d ago

Someone awhile ago did a quant shootout for Qwen3.6

A Reddit post features a quantization performance comparison for Qwen3.6, with a user noting they performed rough mathematical calculations on the results. The post includes a visual chart and links to the original image and comments.

media r/LocalLLaMA · 8d ago

Quantitative Comparison of Qwen3.6 Model Performance

A Reddit post presents a quantitative comparison of Qwen3.6's performance in reduced-precision (quantized) versions. The author notes a rough calculation suggesting Qwen3.6 maintains strong performance even at lower bit depths, though the math is described as shoddy and not rigorously validated.

media r/LocalLLaMA · 8d ago

I didn't know it was possible to compile llamacpp to run CUDA + Vulkan at the same time

A user compiled llamacpp with both CUDA and Vulkan support to leverage two GPUs, the w7800 and another card. The setup achieved +10% tokens/sec in decoding for a MiniMax-M3-UD-IQ2_M-00001-of-00004.gguf model, with plans to run benchmarks to assess real performance gains.

media r/LocalLLaMA · 8d ago

Minimax M3 (4-bit MLX) Initial Benchmark on Mac Studio M3 with 512GB

The Minimax M3 (4-bit MLX) was benchmarked on a Mac Studio M3 with 512GB storage. Results show token throughput and latency metrics across different prompt sizes, with peak performance at 269.1 tok/s for 8192-token prompts and 172.8 tok/s for a 65k-token prompt, using 228GB of peak memory.

media r/LocalLLaMA · 8d ago

Cheapest hardware for Qwen 3.6: 27B and 35B-A3B models

A Reddit post discusses the cost-effective hardware setup for running Qwen 3.6 models, both 27B and 35B-A3B, noting that RTX 3090 24GB offers better long-term value over Tesla V100 due to discontinuation and upcoming Chinese alternatives. The proposed build totals $1,995.65, including a Ryzen 5 5600X, RTX 3090 24GB, and essential components, with the total price being a key concern for users seeking affordability.

media r/LocalLLaMA · 8d ago

Anyone running Qwen 3.6 27b UD Q8 on multiple GPUs?

A user asks if anyone has successfully run Qwen 3.6 27b UD Q8 on multiple GPUs, noting issues with llamacpp and vllm. The model crashes or hangs during multi-turn requests, with llamacpp showing CUDA errors and vllm failing mid-turn, despite working well with Q5 quantization.

github llama.cpp · 9d ago

llama.cpp releases b96669 with backend sampling for Eagle3

llama.cpp version b9669 adds backend sampling support for Eagle3. The release includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, including Vulkan, CUDA, ROCm, OpenVINO, and SYCL.

github llama.cpp · 9d ago

llama.cpp Release b9670: Fixes and New Builds

llama.cpp release b9670 includes fixes for NVFP4 edge cases in llama-graph, such as moving post-GEMM MUL operations and restricting build_ffn to supported combinations. The release provides binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and backend options, including CUDA, Vulkan, SYCL, and OpenVINO.

github llama.cpp · 9d ago

llama.cpp Release b9667 Adds Vulkan and CUDA Support

llama.cpp release b9667 introduces Vulkan support with S_v=16 via gated_delta_net. It includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures, with options for Vulkan, CUDA 12.4 and 13.3, ROCm, OpenVINO, and SYCL.

media r/LocalLLaMA · 9d ago

Qwen3.6 27B Quantization Performance Test Results

A test comparing Q8 and IQ3 XXS turbo4 quantized versions of Qwen3.6 27B shows that Q8 excels in API safety and input sanitization, while IQ3 XXS turbo4 performs better in thread management and modular code design. The model recommends merging both approaches: using Q8 for initial launch protection and IQ3 XXS for atomic writes and thread lifecycle, forming a combined Phase 1 foundation.

github llama.cpp · 9d ago

llama.cpp release b9665 adds --offline flag and new binary builds

llama.cpp version b9665 introduces a new --offline flag for benchmarking. The release includes binary builds for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, including Vulkan, CUDA, ROCm, OpenVINO, and SYCL.

github llama.cpp · 9d ago

LLaMA.cpp Release b9663 Adds SYCL Support and New Binary Builds

LLaMA.cpp release b9663 adds support for OP EXPM1 and all unit test cases for FLOOR, TRUNC, and ROUND. It includes updated binaries for macOS, Linux, Android, Windows, and openEuler, with support for SYCL (FP32 and FP16), Vulkan, CUDA 12.4 and 13.3, and ROCm 7.2, along with an updated UI.

github llama.cpp · 9d ago

sycl: support reordered Q4_K/Q5_K/Q6_K MoE MUL_MAT_ID

The sycl update extends support for reordered expert tensor handling in MoE MUL_MAT_ID to Q4_K, Q5_K, and Q6_K. Unsupported 3D reorder cases now fallback instead of aborting.