All articles — korshunov.ai

All articles Page 1 / 129

Repurposing an Old Multi-GPU Node for Local Inference

The node features 8 NVIDIA Quadro RTX 6000 GPUs with 192 GB VRAM and 512 GB RAM, enabling large-scale local AI model inference. Models like LLaMA-3 or Mistral with 8-13 billion parameters could run efficiently here, offering faster, private, and low-latency performance compared to single-GPU setups, making it worthwhile for internal use.

github CrewAI · 13d ago

v1.14.8a1 Release Notes

Version 1.14.8a1 adds an optional if expression to each.do steps and fixes JSON crew issues. The snapshot and changelog for v1.14.8a have been updated. Contributors include @joaomdmoura and @vinibrsl.

media r/LocalLLaMA · 13d ago

Local Qwen isn't a worse Opus, it's a different tool

The article argues that Local Qwen is not inferior to Opus, but rather serves a different purpose. It emphasizes that each model is designed for specific use cases, and comparing them directly overlooks their distinct capabilities and intended applications.

media r/LocalLLaMA · 13d ago

Calibrating 2-bit GGUFs for agentic coding tasks

2-bit quantized versions of Qwopus3.6-27B-Coder, calibrated on real agentic coding logs, achieve a 63% pass rate on SWE-rebench. The IQ2_M quant outperforms non-calibrated versions and rivals Q5_K_M in pass rate despite being half the size, with improved robustness to loops and faster decoding due to a bundled MTP.

media Latent Space · 13d ago

Why AI Scaling Is a Systems Problem, Not Just a GPU Race

The AI scaling debate overlooks that maximizing model FLOP utilization is more critical than buying more GPUs. Frontiers like xAI operate at sub-10% MFU, while historical models achieved 21% to 70% MFU, indicating systemic inefficiencies in scheduling, networking, and cluster management. Anjney Midha argues that AI infrastructure must evolve into efficient, aligned, and responsible systems, with 'output maxing' emerging as a new discipline for frontier AI.

media r/LocalLLaMA · 13d ago

North Mini Code: 4-bit quant, Ollama, and OpenRouter support

Cohere Labs has released a 4-bit quantized version of North Mini Code on Hugging Face, reducing its size to approximately 20GB for local execution on devices like Macs. The model is now supported in Ollama, local runtimes based on llama.cpp, and via the OpenRouter API, improving accessibility for developers.

media r/LocalLLaMA · 13d ago

LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M Released

LFM2.5-Embedding-350M is a dense bi-encoder that provides fast multilingual retrieval with one vector per document, achieving best-in-class accuracy for its size and inference speed comparable to smaller models. LFM2.5-ColBERT-350M is a late interaction retriever with best-in-class multilingual accuracy, enabling cross-lingual retrieval by storing one vector per token and supporting retrieval in multiple languages with high precision. Both models are designed as drop-in replacements for existing RAG pipelines.

media r/LocalLLaMA · 13d ago

Real-world token cost savings from rtk, headroom, and caveman

A real workload analysis shows headroom, rtk, and caveman reduce token costs by 2.8%, 0.5%, and 0.4% respectively, totaling 3.7% of baseline spending. However, savings are limited by payload diversity, with most traffic being plain text or source code, and the tools only compress structured outputs. Most cost reduction occurs on the cheapest token stream—cache reads—while the tools do not affect prompt caching or output costs, and coverage gaps exist, especially for rtk.

media r/LocalLLaMA · 13d ago

Laguna M.1: 225B Parameter MoE Model for Agentic Coding

Laguna M.1 is a 225B-parameter mixture-of-experts model with 23B activated parameters per token, designed for agentic coding and long-horizon tasks. It achieves competitive performance on SWE-bench Verified (74.6%), SWE-bench Multilingual (63.1%), and Terminal-Bench 2.0 (45.8%), outperforming models like Devstral 2 and GLM-4.7 on key benchmarks.

github llama.cpp · 13d ago

llama.cpp Release b9703: Updates and Binary Downloads

llama.cpp version b9703 includes a rework of the server's preset handling, removing remote HF preset support and deprecated functions. The release provides binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, including Vulkan, CUDA, OpenVINO, and SYCL.

github llama.cpp · 13d ago

llama.cpp release b9704: fixes invalid grammar handling and adds new binaries

llama.cpp version b9704 now returns HTTP 400 for invalid grammar instead of silently dropping constraints. The release includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware accelerators, with support for Vulkan, ROCm, OpenVINO, SYCL, and CUDA.

media r/LocalLLaMA · 13d ago

My suitcase robot gets high from real gas sensor

A real MQ-2 gas sensor detects smoke and feeds live data to an LLM sampler, adjusting temperature, top_p, and top_k in real time. As smoke increases, the robot's speech becomes loopier and more associative, with no scripted 'stoned' mode, demonstrating live model behavior driven by physical input.

media r/LocalLLaMA · 13d ago

mistral.rs v0.8.10 adds /v1/skills support for local models

mistral.rs v0.8.10 introduces OpenAI-compatible Agent Skills via a /v1/skills endpoint, enabling local models to execute domain-specific instructions and scripts without relying on frontier APIs. The update supports tools like file uploads and downloads via /v1/files and includes prebuilt binaries for Linux, macOS, and Windows.

media r/LocalLLaMA · 13d ago

GLM-5.2 Inference Free on Hugging Face for Next 6 Hours

Hugging Face is offering free inference access for GLM-5.2 for the next six hours. Users can access the model via the Hugging Face platform, with a recommended prompt provided in the post.

media r/LocalLLaMA · 13d ago

unsloth GLM-5.2-GGUF with 2bit quantization at 238GB

The unsloth GLM-5.2-GGUF model is available with 2bit quantization, sized at 238GB. It is hosted on Hugging Face and shared via a Reddit post in the LocalLLaMA community.

media r/LocalLLaMA · 13d ago

GLM-5.2 Is The Best Open Weight Creative Writing Model

Sam Paech's Creative Writing Benchmark on EQ Bench ranks GLM-5.2 as the top open-weight creative writing model. The assessment is based on performance metrics from the EQ Bench creative writing evaluation.

github llama.cpp · 13d ago

llama.cpp Release b9702: Fixes and New Binaries

llama.cpp version b9702 includes a fix for router args not being forwarded to child instances. The release provides binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, including Vulkan, ROCm, OpenVINO, and SYCL.

media r/LocalLLaMA · 13d ago

Best place to sell a barely-used RTX PRO 6000 Blackwell Max-Q

A user asks where to sell a barely-used RTX PRO 6000 Blackwell Max-Q, purchased for local AI inference with minimal usage. They consider r/hardwareswap, eBay, or niche pro/workstation markets, seeking advice on realistic pricing and buyer expectations like warranty transfer or invoice.

media Don't Worry About the Vase · 13d ago

White House Pauses AI Deployment

The U.S. White House paused the deployment of frontier AI models, including Claude Fable 5 and Claude Mythos 5, citing a reported 'jailbreak' where the AI could identify and fix security vulnerabilities in code. Anthropic has been working with the Trump Administration to resolve the issue, but experts argue that the problem is fundamental—AI either can write secure code or it cannot, making a fix impossible without undermining its defensive capabilities.

media r/LocalLLaMA · 13d ago

SLMs and Diffusion: The Future of Small, Specialized Models?

Users discuss whether task-specific small language models (SLMs) can outperform larger models in specific tasks, citing benchmarks where 9B models match or exceed larger ones. They propose a sequential agentic workflow using multiple specialized models, with one coordinating and others verifying answers, suggesting diffusion models could accelerate such workflows despite reduced intelligence.