Code generation — korshunov.ai

Code generation Page 9 / 14

GLM-5.2 (744B, 2-bit) achieves 7.3 tok/s on 4×3090 with 192GB RAM

GLM-5.2 UD-IQ2_M runs at ~7.3 tokens per second on 4×RTX 3090s with 192GB DDR5 RAM using llama.cpp expert offload. Reducing quantization from IQ2 to IQ1 provided no speed gain, while increasing CPU threads from 6 to 12 improved performance by 22%. Decode is limited by CPU compute, not memory bandwidth, and the offloaded experts must be explicitly distributed across GPUs to avoid out-of-memory errors.

media r/LocalLLaMA · 7d ago

DiffusionGemma 26B on 4090 reaches 475t/s with limitations

DiffusionGemma 26B runs at up to 475t/s on a 4090 via vLLM with INT4 AWQ quantization, achieving speeds between 290t/s and 700t/s based on output length. However, it suffers from single-user operation, lower response accuracy, rapid context loss, and slower time-to-first-token compared to standard 26B models.

media r/LocalLLaMA · 7d ago

Running GLM-5.2 on CPU Only with Local Setup

A user runs GLM-5.2 locally on a Dell PowerEdge R740 with dual Xeon 6248R CPUs and 768GB RAM, using ik_llama.cpp for improved CPU inference. After isolating one NUMA node for optimal performance, they achieve 4–5.5 tokens per second in chat and about 3 tokens per second in coding tasks, noting the model shows 'frontier vibes' during code generation despite limited usability on this hardware.

media r/LocalLLaMA · 7d ago

Repurposing an Old Multi-GPU Node for Local Inference

The node features 8 NVIDIA Quadro RTX 6000 GPUs with 192 GB VRAM and 512 GB RAM, enabling large-scale local AI model inference. Models like LLaMA-3 or Mistral with 8-13 billion parameters could run efficiently here, offering faster, private, and low-latency performance compared to single-GPU setups, making it worthwhile for internal use.

media r/LocalLLaMA · 7d ago

Calibrating 2-bit GGUFs for agentic coding tasks

2-bit quantized versions of Qwopus3.6-27B-Coder, calibrated on real agentic coding logs, achieve a 63% pass rate on SWE-rebench. The IQ2_M quant outperforms non-calibrated versions and rivals Q5_K_M in pass rate despite being half the size, with improved robustness to loops and faster decoding due to a bundled MTP.

media r/LocalLLaMA · 7d ago

North Mini Code: 4-bit quant, Ollama, and OpenRouter support

Cohere Labs has released a 4-bit quantized version of North Mini Code on Hugging Face, reducing its size to approximately 20GB for local execution on devices like Macs. The model is now supported in Ollama, local runtimes based on llama.cpp, and via the OpenRouter API, improving accessibility for developers.

media r/LocalLLaMA · 7d ago

Real-world token cost savings from rtk, headroom, and caveman

A real workload analysis shows headroom, rtk, and caveman reduce token costs by 2.8%, 0.5%, and 0.4% respectively, totaling 3.7% of baseline spending. However, savings are limited by payload diversity, with most traffic being plain text or source code, and the tools only compress structured outputs. Most cost reduction occurs on the cheapest token stream—cache reads—while the tools do not affect prompt caching or output costs, and coverage gaps exist, especially for rtk.

media r/LocalLLaMA · 7d ago

Laguna M.1: 225B Parameter MoE Model for Agentic Coding

Laguna M.1 is a 225B-parameter mixture-of-experts model with 23B activated parameters per token, designed for agentic coding and long-horizon tasks. It achieves competitive performance on SWE-bench Verified (74.6%), SWE-bench Multilingual (63.1%), and Terminal-Bench 2.0 (45.8%), outperforming models like Devstral 2 and GLM-4.7 on key benchmarks.

media r/LocalLLaMA · 7d ago

GLM-5.2 Is The Best Open Weight Creative Writing Model

Sam Paech's Creative Writing Benchmark on EQ Bench ranks GLM-5.2 as the top open-weight creative writing model. The assessment is based on performance metrics from the EQ Bench creative writing evaluation.

media r/LocalLLaMA · 7d ago

SLMs and Diffusion: The Future of Small, Specialized Models?

Users discuss whether task-specific small language models (SLMs) can outperform larger models in specific tasks, citing benchmarks where 9B models match or exceed larger ones. They propose a sequential agentic workflow using multiple specialized models, with one coordinating and others verifying answers, suggesting diffusion models could accelerate such workflows despite reduced intelligence.

media r/LocalLLaMA · 7d ago

Llama Bench vs Real-World Performance Discrepancy

The user reports a significant gap between Llama benchmark results and actual model performance. Benchmarks show 754 tk/s prefill and 36 tk/s generation, but real usage reveals only 7.98 tokens per second, with high latency and poor throughput. The discrepancy is attributed to real-world usage conditions, not benchmark settings, suggesting the model's actual performance is far below the benchmarked speed.

github llama.cpp · 7d ago

LLaMA.cpp Release b9698 Adds Self-Update Support and Multiple Platform Binaries

LLaMA.cpp version b9698 enables self-updates only when built with llama-install.sh. The release includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, including Vulkan, CUDA, OpenVINO, and SYCL.

github AutoGPT · 7d ago

autogpt-platform-beta-v0.6.64 Released

The autogpt-platform-beta-v0.6.64 release, dated 18th June 2026, introduces new features such as the AutoPilot Context Panel and Global Search, along with enhancements to graph saving, caching, and builder performance. It also includes security hardening, bug fixes for LLM provider issues, and UI improvements like a high-resolution touch icon.

github CrewAI · 7d ago

CrewAI v1.14.8a Releases New FlowDefinition Features

CrewAI v1.14.8a introduces script and crew actions to FlowDefinition, adds DMN mode support, and enables flow execution without Python code. It also includes experimental support for JSON-first crews and ZIP deployment fallback, along with improved memory reset and token usage tracking.

github llama.cpp · 7d ago

llama.cpp Release b9693 Adds BF16 Support and Cross-Platform Binaries

llama.cpp version b9693 introduces BF16 support in its concat kernel and provides pre-built binaries for macOS, Linux, Android, Windows, and openEuler. The release includes CPU, Vulkan, ROCm, OpenVINO, SYCL, and HIP variants across multiple architectures, with a dedicated UI package available.

media r/LocalLLaMA · 7d ago

LocalLLaMA proposes crowdsourced coding dataset

A community initiative suggests creating a crowdsourced coding dataset to enable local LLM development. The proposal aims to allow anyone with hardware to contribute data, with more powerful users helping to fine-tune or quantize models, thus reducing reliance on company-released models.

media r/LocalLLaMA · 7d ago

GLM-5.2 Review and Censorship Response

GLM-5.2 demonstrates exceptional long-context coherence and conversational fluency, outperforming Gemini-3.1-Pro on text-only tasks and matching GPT-5.5 in reasoning quality. The model responds factually to sensitive topics like Taiwan and Tiananmen Square, providing detailed historical context without overt censorship, though it adheres to Chinese government content guidelines.

arxiv arXiv cs.AI · 7d ago

LLM-as-Interface, ML-as-Predictor for Pediatric Appendicitis

ClaMPAPP, a hybrid system, uses an LLM to extract structured clinical features from free-text notes and passes them to an XGBoost classifier for diagnosis. It outperformed end-to-end LLMs in both internal and external validation, with better diagnostic performance and fewer missed cases, demonstrating superior stability and safety in pediatric appendicitis triage.

arxiv arXiv cs.AI · 7d ago

Trade-offs in Medical LLM Adaptation: French QA Study

A study compares continual pretraining (CPT), supervised fine-tuning (SFT), and their combination for French medical QA. CPT+SFT performs best in multiple-choice QA, though gains over SFT are small and often insignificant, making SFT a cost-effective default. For open-ended QA, CPT improves metrics while SFT degrades quality, with instruction tuning and CPT+SFT favored by LLM-based evaluations. Cross-lingual results show effective transfer from French to English benchmarks.

arxiv arXiv cs.AI · 7d ago

Reverse-Engineering Transformer Attention with Executable Programs

A new method uses program synthesis to generate Python programs that reproduce attention patterns in transformer models. Fewer than 1,000 such programs achieve over 75% intersection-over-union similarity on TinyStories, and replacing 25% of attention heads with these programs increases perplexity by only 16% while preserving performance on question-answering tasks.