OpenAI and Broadcom unveil LLM-optimized inference chip
OpenAI and Broadcom have introduced Jalapeño, a custom AI chip designed for large language model inference. The chip aims to enhance performance, efficiency, and scalability in AI systems.
OpenAI and Broadcom have introduced Jalapeño, a custom AI chip designed for large language model inference. The chip aims to enhance performance, efficiency, and scalability in AI systems.
The ggml project has optimized AMX performance by flattening the partition over n_batch * M, ensuring all threads participate in quantization. This change improves speed by up to 1.47x across various models and hardware configurations on CPU and GPU platforms, with results showing consistent gains in inference time.
Claude Code v2.1.181 introduces support for setting config settings via prompt syntax like /config thinking=false, adds sandbox Apple Events support on macOS, and improves streaming, auto-retry, and subagent behavior. It also fixes numerous bugs related to startup, file handling, clipboard, and UI responsiveness across platforms.
The llama.cpp project has released version b9785, featuring a code change to harden caps checks as detailed in pull request #24973. This update provides pre-built binaries for macOS Apple Silicon, Intel Macs, and iOS via an XCFramework, with KleidiAI support disabled on Apple Silicon. Linux distributions including Ubuntu are supported for CPU, Vulkan, ROCm 7.2, OpenVINO, and SYCL backends across x64, arm64, and s390x architectures. Android users can access arm64 CPU binaries, while Windows offers extensive options covering CPU, OpenCL Adreno, CUDA 12 and 13, Vulkan, OpenVINO, SYCL, and HIP. The release also includes builds for openEuler targeting x86 and aarch64 processors with ACL Graph support. A standalone UI package is available alongside the platform-specific releases to facilitate local model inference.
LLaMA.cpp releases version b9784 with major optimizations for hexagon-based MM operations, including 32x32 tiled weight repack, improved dyn.quant handling, and unified kernel parameter management. The release includes new binaries for macOS (arm64 and x64), iOS, and multiple Linux architectures with support for Vulkan, ROCm, and OpenVINO.
llama.cpp releases version b9782, including binaries for macOS, Linux, Android, Windows, and openEuler. The release adds support for Vulkan, OpenVINO, SYCL, ROCm, and CUDA across multiple architectures, with updated UI and disabled features such as KleidiAI and openEuler support.
llama.cpp releases version b9781, adding Vulkan support for Linux and Windows, and expanding to multiple architectures including ARM64 and x64 across macOS, Linux, Android, and Windows. The release includes CPU, CUDA, OpenVINO, SYCL, and ROCm builds, with a UI component available.
LLaMA.cpp release b9777 adds LFM2.5-ColBERT-350M and LFM2.5-Embedding-350M models. The release includes pre-built binaries for macOS, Linux, Android, Windows, and openEuler, supporting various architectures and acceleration technologies like CUDA, Vulkan, OpenVINO, and SYCL.
llama.cpp version b9776 introduces Vulkan support for Linux and Windows, along with CPU, OpenCL, CUDA, and SYCL variants across macOS, Linux, Android, and Windows. The release also includes support for OpenVINO and ROCm, with UI available in a standalone package.
llama.cpp release b9774 adds Vulkan backend support for SQR, SQRT, SIN, COS, CLAMP, LEAKY_RELU, and NORM operations, with support for noncontiguous inputs. The release includes binary builds for macOS, Linux, Android, Windows, and openEuler across multiple architectures and backends including CUDA, OpenVINO, SYCL, and ROCm.
LLaMA.cpp has released version b9775, introducing binaries for macOS, Linux, Android, Windows, and openEuler across various architectures. The release includes CPU, Vulkan, OpenVINO, SYCL, and ROCm support, with updated CUDA versions (12.4 and 13.3) and iOS XCFramework availability. A UI package is also provided.
LLaMA.cpp release b9771 introduces Vulkan support across Linux and Windows, reducing shader variants and binary size by making mul_mm ALIGNED a spec constant. The release includes binaries for macOS, Linux, Android, Windows, and openEuler, with variants for CPU, Vulkan, OpenVINO, SYCL, and ROCm.
llama.cpp now links ggml-cpu when GGML_VULKAN_CHECK_RESULTS or GGML_VULKAN_RUN_TESTS are enabled to resolve linking failures. This fix restores debug functionality for Vulkan result verification and testing after the ggml-cpu library was split.
llama.cpp release b9767 improves MTP inference using mat-vec paths for small batches and includes updated GPU support. The release provides binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and APIs including Vulkan, CUDA, OpenVINO, and SYCL.
llama.cpp version b9763 introduces an ID field in tool call responses. The release includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, with a UI component also available.
Mateusz has developed a full pre-trained language model, Project Inkblot's Titan v1, combining Mamba SSM, Multi-Head Attention, and 32-expert MoE in a single decoder-only architecture under 1B parameters. The model, trained on a single NVIDIA L4 GPU for ~$50, achieves 27.5 validation perplexity and demonstrates efficient scaling via a single-line config update, with all components implemented from scratch in PyTorch. Titan v2's first training cycle is now complete, and dataset expansion is underway.
llama.cpp version b9741 introduces new binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures. The release includes support for Vulkan, CUDA 12.4 and 13.3, OpenVINO, SYCL, and ROCm, with updated versions for iOS and Ubuntu.
A patch addresses random failures in the test-args-parser on Windows by modifying argv override to only apply when argc matches, preventing clobbering of programmatic arguments. This fixes a fastfail assertion in the OpenVINO Windows workflow while preserving UTF-8 handling for real binaries.
llama.cpp version b9738 fixes the CORS proxy to avoid forwarding authentication headers. The release includes binary builds for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, including Vulkan, CUDA, OpenVINO, and SYCL.
The GLM-5.2 model's DSA indexer was incorrectly loaded on all layers, causing failures due to missing tensors. The update marks indexer tensors as TENSOR_NOT_REQUIRED, allowing layers without an indexer to load as nullptr and enabling full MLA attention. DeepSeek-V3.2, with uniform indexing, is unaffected.