Inference efficiency — korshunov.ai

Topic · Inference efficiency

OpenAI and Broadcom unveil LLM-optimized inference chip

OpenAI and Broadcom have introduced Jalapeño, a custom AI chip designed for large language model inference. The chip aims to enhance performance, efficiency, and scalability in AI systems.

github llama.cpp · 5d ago

ggml optimizes AMX with partition flattening

The ggml project has optimized AMX performance by flattening the partition over n_batch * M, ensuring all threads participate in quantization. This change improves speed by up to 1.47x across various models and hardware configurations on CPU and GPU platforms, with results showing consistent gains in inference time.

lab Claude Code Releases · 7d ago

Claude Code v2.1.181 Release Notes

Claude Code v2.1.181 introduces support for setting config settings via prompt syntax like /config thinking=false, adds sandbox Apple Events support on macOS, and improves streaming, auto-retry, and subagent behavior. It also fixes numerous bugs related to startup, file handling, clipboard, and UI responsiveness across platforms.

github llama.cpp · 1h ago Live

llama.cpp b9785 Release with Hardened Caps Check and Multi-Platform Binaries

The llama.cpp project has released version b9785, featuring a code change to harden caps checks as detailed in pull request #24973. This update provides pre-built binaries for macOS Apple Silicon, Intel Macs, and iOS via an XCFramework, with KleidiAI support disabled on Apple Silicon. Linux distributions including Ubuntu are supported for CPU, Vulkan, ROCm 7.2, OpenVINO, and SYCL backends across x64, arm64, and s390x architectures. Android users can access arm64 CPU binaries, while Windows offers extensive options covering CPU, OpenCL Adreno, CUDA 12 and 13, Vulkan, OpenVINO, SYCL, and HIP. The release also includes builds for openEuler targeting x86 and aarch64 processors with ACL Graph support. A standalone UI package is available alongside the platform-specific releases to facilitate local model inference.

github llama.cpp · 7h ago

LLaMA.cpp Release b9784: Hexagon MM Optimizations and Cross-Platform Binaries

LLaMA.cpp releases version b9784 with major optimizations for hexagon-based MM operations, including 32x32 tiled weight repack, improved dyn.quant handling, and unified kernel parameter management. The release includes new binaries for macOS (arm64 and x64), iOS, and multiple Linux architectures with support for Vulkan, ROCm, and OpenVINO.

github llama.cpp · 9h ago

llama.cpp releases b9782 with new binaries and support

llama.cpp releases version b9782, including binaries for macOS, Linux, Android, Windows, and openEuler. The release adds support for Vulkan, OpenVINO, SYCL, ROCm, and CUDA across multiple architectures, with updated UI and disabled features such as KleidiAI and openEuler support.

github llama.cpp · 12h ago

llama.cpp releases b9781 with Vulkan and multi-platform support

llama.cpp releases version b9781, adding Vulkan support for Linux and Windows, and expanding to multiple architectures including ARM64 and x64 across macOS, Linux, Android, and Windows. The release includes CPU, CUDA, OpenVINO, SYCL, and ROCm builds, with a UI component available.

github llama.cpp · 18h ago

LLaMA.cpp Release b9777 Adds New Models and Cross-Platform Binaries

LLaMA.cpp release b9777 adds LFM2.5-ColBERT-350M and LFM2.5-Embedding-350M models. The release includes pre-built binaries for macOS, Linux, Android, Windows, and openEuler, supporting various architectures and acceleration technologies like CUDA, Vulkan, OpenVINO, and SYCL.

github llama.cpp · 23h ago

llama.cpp release b9776 adds Vulkan and multiple hardware support

llama.cpp version b9776 introduces Vulkan support for Linux and Windows, along with CPU, OpenCL, CUDA, and SYCL variants across macOS, Linux, Android, and Windows. The release also includes support for OpenVINO and ROCm, with UI available in a standalone package.

github llama.cpp · 1d ago

Vulkan backend updates and new binary releases for llama.cpp

llama.cpp release b9774 adds Vulkan backend support for SQR, SQRT, SIN, COS, CLAMP, LEAKY_RELU, and NORM operations, with support for noncontiguous inputs. The release includes binary builds for macOS, Linux, Android, Windows, and openEuler across multiple architectures and backends including CUDA, OpenVINO, SYCL, and ROCm.

github llama.cpp · 1d ago

LLaMA.cpp Release b9775: New Binaries and Support for Multiple Platforms

LLaMA.cpp has released version b9775, introducing binaries for macOS, Linux, Android, Windows, and openEuler across various architectures. The release includes CPU, Vulkan, OpenVINO, SYCL, and ROCm support, with updated CUDA versions (12.4 and 13.3) and iOS XCFramework availability. A UI package is also provided.

github llama.cpp · 2d ago

LLaMA.cpp Release b9771 Adds Vulkan Support and Optimizations

LLaMA.cpp release b9771 introduces Vulkan support across Linux and Windows, reducing shader variants and binary size by making mul_mm ALIGNED a spec constant. The release includes binaries for macOS, Linux, Android, Windows, and openEuler, with variants for CPU, Vulkan, OpenVINO, SYCL, and ROCm.

github llama.cpp · 2d ago

Fix for Vulkan result checking and test linking in llama.cpp

llama.cpp now links ggml-cpu when GGML_VULKAN_CHECK_RESULTS or GGML_VULKAN_RUN_TESTS are enabled to resolve linking failures. This fix restores debug functionality for Vulkan result verification and testing after the ggml-cpu library was split.

github llama.cpp · 2d ago

llama.cpp release b9767 adds GPU and multi-platform support

llama.cpp release b9767 improves MTP inference using mat-vec paths for small batches and includes updated GPU support. The release provides binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and APIs including Vulkan, CUDA, OpenVINO, and SYCL.

github llama.cpp · 2d ago

llama.cpp Release b9763 Adds ID to Tool Call Responses

llama.cpp version b9763 introduces an ID field in tool call responses. The release includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, with a UI component also available.

media Hugging Face Forums · 3d ago

I built a novel triple-hybrid LLM under 1B parameters for ~$50

Mateusz has developed a full pre-trained language model, Project Inkblot's Titan v1, combining Mamba SSM, Multi-Head Attention, and 32-expert MoE in a single decoder-only architecture under 1B parameters. The model, trained on a single NVIDIA L4 GPU for ~$50, achieves 27.5 validation perplexity and demonstrates efficient scaling via a single-line config update, with all components implemented from scratch in PyTorch. Titan v2's first training cycle is now complete, and dataset expansion is underway.

github llama.cpp · 4d ago

llama.cpp Release b9741 Adds New Binaries and Support

llama.cpp version b9741 introduces new binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures. The release includes support for Vulkan, CUDA 12.4 and 13.3, OpenVINO, SYCL, and ROCm, with updated versions for iOS and Ubuntu.

github llama.cpp · 4d ago

Fix for test-args-parser random failures on Windows

A patch addresses random failures in the test-args-parser on Windows by modifying argv override to only apply when argc matches, preventing clobbering of programmatic arguments. This fixes a fastfail assertion in the OpenVINO Windows workflow while preserving UTF-8 handling for real binaries.

github llama.cpp · 5d ago

llama.cpp release b9738: fixes CORS auth header forwarding and new binary builds

llama.cpp version b9738 fixes the CORS proxy to avoid forwarding authentication headers. The release includes binary builds for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, including Vulkan, CUDA, OpenVINO, and SYCL.

github llama.cpp · 5d ago

GLM-5.2 DSA indexer fix: tensors marked not required

The GLM-5.2 model's DSA indexer was incorrectly loaded on all layers, causing failures due to missing tensors. The update marks indexer tensors as TENSOR_NOT_REQUIRED, allowing layers without an indexer to load as nullptr and enabling full MLA attention. DeepSeek-V3.2, with uniform indexing, is unaffected.