Code generation — korshunov.ai

Code generation Page 1 / 14

Best models for a 12GB VRAM card

A user with a 12GB VRAM GPU asks for model recommendations for general chatting, roleplaying, and coding. They prioritize uncensored models for chat and roleplaying, and have a Ryzen 5600 CPU and 32GB DDR4 RAM.

lab Claude Code Releases · 8d ago

Claude Code v2.1.181 Release Notes

Claude Code v2.1.181 introduces support for setting config settings via prompt syntax like /config thinking=false, adds sandbox Apple Events support on macOS, and improves streaming, auto-retry, and subagent behavior. It also fixes numerous bugs related to startup, file handling, clipboard, and UI responsiveness across platforms.

github llama.cpp · 8d ago

ggml-cpu: Conditionally enable POWER11 backend based on compiler support

The ggml-cpu project now conditionally enables the POWER11 backend in ggml based on compiler support for -mcpu=power11. This prevents build failures on current GCC/Clang toolchains while maintaining forward compatibility. Updates to CMakeLists.txt support this change, and -mcpu=power10 is used for both P10 and P11 architectures.

github llama.cpp · 8d ago

llama.cpp Release b9692 Adds New Binaries and Fixes

llama.cpp version b9692 introduces new binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures. The release includes updates to support Vulkan, ROCm, OpenVINO, SYCL, and HIP, with fixes to remove batch dim usage in llava_uhd.

media r/LocalLLaMA · 8d ago

Lemonade v10.8 Releases Auto Memory Management, Cloud Offload, and MCP Tool Support

Lemonade v10.8 introduces dynamic VRAM management that auto-unloads idle models and downsizes KV-cache to reclaim GPU memory. It adds cloud offload support for OpenAI-compatible providers, enabling local-first model serving with optional cloud routing. A new MCP gateway exposes local models as tools via POST /mcp, allowing local models to be used as tools in MCP-aware applications.

media r/LocalLLaMA · 8d ago

GLM 5.2 Release Video Made with GLM 5.2

A video showcasing GLM 5.2's capabilities was created and shared online. Users note it performs well in web development tasks, though still below top models like Gemini 3.1 Pro in video generation. Long outputs are frequently timed out on OpenRouter, requiring users to switch providers to receive full responses.

media r/LocalLLaMA · 8d ago

We need a 80-160B model urgently for unified memory devices

Users with 80-160GB unified memory or high-bandwidth RAM face limitations due to the lack of models sized for their hardware. Existing models are either too small for performance or too large for memory constraints, prompting a call for 100B-scale sparse models like Qwen 3.5 122B or Gemma 4 122B to better serve users with AMD AI Pro, RTX 3090/5090, or Apple devices.

github llama.cpp · 8d ago

llama.cpp Release b9687 Adds New Binaries and Fixes

llama.cpp version b9687 introduces new binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures. The release includes support for Vulkan, ROCm, OpenVINO, SYCL, and HIP, with updates to improve device validation and performance on available hardware.

github llama.cpp · 8d ago

llama.cpp releases version b9688 with new APIs and cross-platform binaries

llama.cpp releases version b9688, adding model management and SSE realtime updates APIs. The release includes prebuilt binaries for macOS, Linux, Android, Windows, and openEuler, supporting various architectures and acceleration frameworks like Vulkan, CUDA, OpenVINO, and SYCL.

media r/LocalLLaMA · 8d ago

PSA: unsloth/GLM-5.2-GGUF is uploading

A Reddit user noticed that the unsloth/GLM-5.2-GGUF repository was created just half an hour ago and currently contains only a README. They suspect that GGUF model files are being uploaded and have shared a link to the repository.

media r/LocalLLaMA · 8d ago

GLM-5.2-FP8 HGX-H200 SGLang Docker Deployment Config

A user shares a Docker configuration for running GLM-5.2-FP8 on HGX-H200 hardware using SGLang. The setup achieves 262k context length and 70 tokens per second with 8 tensor parallelism, using a memory fraction of 0.83. The user notes that vLLM official recipes do not work on H200 due to KV cache FP8 quantization limitations on the DSV3 architecture.

github llama.cpp · 8d ago

LLaMA.cpp Release b9685 Adds SYCL Dev2Dev Memcpy and Multiple Platform Binaries

LLaMA.cpp version b9685 introduces SYCL-based dev2dev memcpy functionality, moving GGML_SYCL_DEV2DEV_MEMCPY to runtime table and improving peer-to-peer communication detection. The release includes precompiled binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and APIs including Vulkan, ROCm, OpenVINO, and SYCL (FP32/FP16).

media r/LocalLLaMA · 8d ago

LoopCoder-V2: Two-Loop PLT Model Achieves Best Gain-Cost Trade-Off

LoopCoder-V2 is a 7B instruction-tuned code model based on Parallel Loop Transformer (PLT), trained on 18T tokens of mixed text and code data. The two-loop variant achieves the best gain-cost balance, improving SWE-bench Verified from 43.0 to 64.4, while three or more loops result in regression due to increasing positional mismatch and unstable updates.

media r/LocalLLaMA · 8d ago

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

GameCraft-Bench evaluates whether large language models can build playable games end-to-end using a real game engine. The benchmark includes assessments of major models like Opus-4.7 and GPT-5.5, with interest in how medium-sized models (e.g., 30-70B parameters) perform on game development tasks.

blog Simon Willison · 8d ago

AI Demands More Engineering Discipline

In 2025, the economics of code production shifted dramatically, making code generation effectively free and instant. This change caused a cultural shift in software development, where lines of code moved from being carefully curated to being disposable and regenerable.

github llama.cpp · 8d ago

LLaMA.cpp Release b9684 Adds Conv_3D and Multiple Platform Binaries

LLaMA.cpp release b9684 introduces a new 3D convolution operation (conv_3d) and includes optimized implementations. The release provides prebuilt binaries for macOS, Linux, Android, Windows, and openEuler across various architectures and hardware acceleration options, including SYCL, Vulkan, CUDA, and OpenVINO.

media r/LocalLLaMA · 8d ago

GLM 5.2 on 4x Sparks: Reasonable?

A user asks whether running GLM-5.2 on four Ascend GX10 chips (DGX Sparks) is feasible. They inquire about 4-bit quantization using 512GB unified memory and estimate prompt and output token speeds for 100k context length, noting no existing performance data is available online.

github llama.cpp · 8d ago

llama.cpp release b9682 adds Vulkan support and new platform binaries

llama.cpp version b9682 introduces Vulkan support for Linux and Windows, enabling GPU acceleration. The release includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures, with CPU and GPU options including CUDA, OpenVINO, SYCL, and ROCm.

media r/LocalLLaMA · 8d ago

GLM-5.2 is a win for local AI

GLM-5.2, with 753B parameters and a 1M-token context window, is now accessible on local hardware through quantization. Its MIT license and extensive training data enable community fine-tuning of smaller models, promising significant improvements for local AI setups.

media r/LocalLLaMA · 8d ago

Headless screenshot loops enable a 30B local agent to debug raytraced FPS in pure C

A local 30B agent, using headless screenshot loops, autonomously debugs a raytraced FPS demo in pure C by capturing frames at key events and iterating on fixes. The agent builds a recursive visual debugging loop, demonstrating that simple feedback mechanisms can enable small models to solve complex, visually grounded tasks.