All articles — korshunov.ai

All articles Page 1 / 110

Ornith-1.0-35B GGUF update: native MTP speculative-decode graft + full serving/TTFT/long-context numbers (llama.cpp, tp=1)

This article reports on an update to the Ornith-1.0-35B model, featuring a native MTP draft head grafted onto the IQ4_XS body for self-speculative decoding in llama.cpp. The author provides comprehensive performance metrics including throughput, time-to-first-token (TTFT), and long-context capabilities on a single RTX PRO 6000 Blackwell GPU.

media r/LocalLLaMA · 6h ago

Apple Refurbished Adds M5 Pro and Max Options

Following Apple's recent price increase, the company has added numerous top-of-the-line 14-inch MacBook Pro models equipped with M5 Pro and M5 Max chips to its refurbished store.

media r/LocalLLaMA · 6h ago

China Has Matched Anthropic in Cybersecurity, Resetting AI Race

A Wall Street Journal report indicates that Chinese artificial intelligence models have achieved parity with Anthropic's Claude in cybersecurity tasks.

media r/LocalLLaMA · 7h ago

Reddit user refutes Dario Amodei's claims against open-source AI

A Reddit post challenges Dario Amodei's assertion that open-source models are inferior to proprietary systems by arguing he misunderstands the technology. The author contends that Amodei is unaware of the transparency and capabilities of current open-weight models.

media Hugging Face Forums · 7h ago

Hypothetical Inquiry on AI Learning Binary Code

A forum user poses a speculative question regarding whether training neural networks or AI systems to understand binary code would significantly enhance their overall capabilities, particularly in coding tasks.

media Hugging Face Forums · 7h ago

Concept: Trading data for data to train AI models

A user proposes a concept for a website where individuals exchange data for data to train AI models, eliminating the need for monetary transactions. The system operates on a credit-based economy where users start with a set amount of credits and post bounties for specific data needs.

media Interconnects · 7h ago

Artifacts 22: Zyphra, Cohere, and Poolside are expanding the breadth of the ecosystem

The open AI model landscape is becoming increasingly diverse, shifting from dominance by a few Chinese players to a broader mix of organizations including sovereign AI initiatives, Big Tech, and product companies.

github llama.cpp · 8h ago

llama.cpp b9833 release: MiniCPM5 parser and multi-platform binaries

The llama.cpp project has released version b9833, introducing a dedicated parser for the MiniCPM5 model alongside various bug fixes and refactoring. This update includes support for tool call parsing, grammar simplification, and corrected Jinja API behavior to ensure compatibility with Jinja2 standards.

github llama.cpp · 9h ago

llama.cpp b9832 release adds --dump-prog debugging flag

The llama.cpp project has released version b9832, introducing a new `--dump-prog` command-line option for the Jinja template engine to aid in debugging. This update also includes pre-built binaries for macOS, Linux, Android, Windows, and openEuler across various CPU and GPU architectures.

media r/LocalLLaMA · 9h ago

Proposal for crowd-sourced, open-source distilled LLMs via distributed training

A Reddit user proposes a system to create truly open-source distilled large language models by wrapping existing command-line AI services. This approach would collect user inputs and outputs from applications like coding assistants or chatbots to build massive datasets through volunteer participation.

media r/LocalLLaMA · 10h ago

DeepSpec: A DeepSeek AI Collection for Speculative Decoding Draft Models

DeepSpec is a full-stack codebase released by deepseek-ai for training and evaluating draft models used in speculative decoding. The project provides data preparation utilities, implementation code, and evaluation scripts to facilitate the development of these auxiliary models.

github llama.cpp · 10h ago

llama.cpp b9831 release adds DFlash support and new binaries

The llama.cpp b9831 release introduces DFlash v2 support, including sliding window attention per layer types, alongside a comprehensive set of pre-built binaries for multiple platforms.

media r/LocalLLaMA · 11h ago

DFlash support merged into llama.cpp

Support for the DFlash format has been merged into the llama.cpp repository. This update enables users to utilize DFlash files within the framework.

media r/LocalLLaMA · 11h ago

Step-3.7-Flash (198B-A11B vision MoE) on 4×3090 — fully-resident IQ3_XXS beats thespilled IQ4 by 2.4×, and MTP speculative decode silently breaks vision

A user demonstrates running StepFun's 198B-parameter Step-3.7-Flash model on a consumer 4×RTX 3090 setup, revealing critical performance trade-offs between quantization levels and multi-token prediction (MTP) with vision capabilities.

media r/LocalLLaMA · 12h ago

What would it take to create /r/localllama's own LLM?

A Reddit user expresses concern over the potential loss of access to open weights for 96GB to 128GB hardware and questions whether a community-driven Large Language Model is feasible.

media r/LocalLLaMA · 12h ago

Sell ddr5 for vram?

A Reddit user asks whether they should sell half of their 768GB DDR5 6400 ECC RAM to purchase RTX 6000 Pro GPUs, citing current RAM prices.

media r/LocalLLaMA · 12h ago

Seeking advice on cases for dual RTX 3090 LLM workstation

A user is building a local LLM workstation using an ASUS Crosshair VIII Hero motherboard and two power-limited RTX 3090 GPUs, seeking recommendations for compatible computer cases.

media r/LocalLLaMA · 12h ago

Qwen3.6 27B local vs Opus 4.8, voxel engine in raw C with zero frameworks

A comparison experiment pitted Claude Code on Opus 4.8 against a locally running Qwen3.6 27B model to build a voxel world engine in plain C without any external frameworks or libraries.

media r/LocalLLaMA · 12h ago

User questions existence of closed vs open LLM rankings and value of 70B-350B models

A Reddit user asks whether a solid leaderboard exists that compares closed-source and open-weight large language models side by side. They note that most available benchmarks feel fragmented and fail to address the practical differences between running models locally versus using API-based services.

media r/LocalLLaMA · 13h ago

Community inquiry on using Q1/Q2 quantization for large language models

A Reddit user asks the community about their experiences using Q1 or Q2 quantization levels for large language models ranging from 100 to 250 billion parameters. The post lists specific models in this size range, such as DeepSeek-V4-Flash and Qwen3-235B-A22B, and contrasts them with smaller models where lower quantization is generally discouraged.