Code generation — korshunov.ai

Code generation Page 1 / 14

New Qwen-27B IQ4_KS and IQ4_KS_KT Quantizations for ik_llama.cpp

Two new GGUF quantizations for Qwen-27B have been released for ik_llama.cpp, optimized for 16GB VRAM on NVIDIA GPUs. The first, Qwen3.6-27B.i1-IQ4_KS-attn_qkv-IQ4_KS.gguf, improves logical reasoning at the cost of general knowledge, with a perplexity of 7.4131. The second, Qwen3.6-27B.i1-IQ4_KS_KT-attn_qkv-IQ4_KS.gguf, applies Trellis quantization (iq4_kt) selectively to tensors with near-Gaussian distributions, achieving a perplexity of 7.4091, showing minimal performance degradation.

media r/LocalLLaMA · 2d ago

Can GLM5.2 be run on 4x AMD EPYC servers with 512GB RAM each?

The user asks if a 467GB GLM 5.2 model can be run on four servers, each with 512GB RAM and 409.6 GB/s memory bandwidth, using CPU-only inference with Unsloth. They consider splitting the model across nodes for token speed or using 8-bit versions in dual clusters to handle larger models and improve performance.

media Together AI Blog · 2d ago

Frontier LLMs Struggle to Write Fast Multi-GPU Kernels

ParallelKernelBench evaluates LLMs on writing fast multi-GPU CUDA kernels for 87 real workloads. The top model generates kernels that perform under a third of the speed of optimal implementations, though a few outputs surpass any existing public code.

lab Anthropic News · 2d ago

Introducing Claude Tag for Slack Teams

Claude Tag allows teams to tag @Claude in Slack to delegate tasks, with access to selected channels, tools, and codebases. It learns from channel context, works asynchronously, and takes initiative by proactively updating users on relevant information. Today, 65% of Anthropic’s product team code is created by internal Claude Tag, and it’s now available in beta for Claude Enterprise and Team customers.

media r/LocalLLaMA · 2d ago

llama-server webui not responding after recompile

The llama-server webui is not responding to prompts, showing only 'processing...' despite the model loading successfully. The CLI interface works normally, and the server health endpoints respond correctly. The issue emerged after a recompile of llama.cpp with CUDA support.

media r/LocalLLaMA · 2d ago

Reusable workflows for long-running local LLMs

Hayden has developed the knot harness to manage long-running local LLM tasks. It enables reusable workflows with agent profiles, file system event monitoring, and automatic triggers, using Pi.dev as the default agent.

media r/LocalLLaMA · 2d ago

Review of Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF

A review discusses the experience with Jackrong's Qwopus Coder MTP variants, comparing them to Qwen3.5 and Qwen3.6 models across 9B, 27B, and 35B parameter sizes. The review focuses on performance and usability of the 9B-Coder-MTP-GGUF model in local LLM deployments.

media r/LocalLLaMA · 2d ago

My local server idling 99% of the time!

A user reports their local server running Qwen3.6-27B with OWU and PI for coding tasks, yet remains idle 99% of the time. They ask the community for ideas on how to better utilize local LLMs with meaningful, 24/7 tasks.

media r/LocalLLaMA · 2d ago

Why is Gemma 4 26b not mentioned more?

Users note a lack of discussion around Gemma 4 26b despite its potential suitability for personal assistant and RAG tasks on a solo 3090. The model is considered a strong candidate for all-in-one local AI applications, though it receives less attention compared to Qwen3.6 or Gemma4 31b.

lab Mistral AI News · 2d ago

Mistral Releases OCR 4 with Multilingual Support and Structured Output

Mistral OCR 4 introduces bounding boxes, block classification, and inline confidence scores for 170 languages across 10 language groups. It outperforms leading OCR systems in human preference evaluations with a 72% win rate and achieves the top score on OlmOCRBench (85.20), while offering self-hosted deployment in a single container and supporting enterprise use cases like RAG and document ingestion.

arxiv arXiv cs.CL · 2d ago

PRIDE: Privileged Information-enhanced Distillation for Empathetic Dialogue Generation

PRIDE introduces a knowledge distillation method that transfers empathetic reasoning from large models to smaller ones using privileged information available only during training. It achieves competitive or superior performance on empathy-related tasks by leveraging structured prompts, multi-source attention, and dual-alignment loss.

arxiv arXiv cs.CL · 2d ago

LangMAP: Language-Adaptive Tokenization for Multilingual Models

LangMAP extends UnigramLM to create language-specific tokenization from a shared vocabulary, enabling multilingual model training or adaptation without vocabulary changes. It improves morphological boundary alignment and AST leaf alignment in coding languages, and enhances grammatical acceptability in target languages, though benefits vary on knowledge-based tasks.

media r/LocalLLaMA · 2d ago

MiniMax M3 EAGLE3 GGUF model now compatible with llama.cpp

The MiniMax M3 EAGLE3 decoder has been converted to GGUF format and is now compatible with llama.cpp. Testing on a 2x3090, 128GB system with UD-Q2_K_XL quant showed performance improved from 2.3 to 5 tokens per second using --fit and keeping the model in VRAM.

media r/LocalLLaMA · 2d ago

Boogu-Image-0.1: Open-Source Unified Image Generation and Editing Model Series

Boogu-Image-0.1 is an Apache-2.0 licensed open-source unified image generation and editing model family, including Base, Turbo, and Edit variants. It offers high-quality text-to-image generation, fast generation, image editing, and strong Chinese-English text rendering, with training data scale roughly one order of magnitude smaller than closed-source systems yet achieving competitive performance through improved model understanding and data quality.

media r/LocalLLaMA · 2d ago

Who needs GPUs? 64 t/s gen, 285 PP on 6-year-old CPUs

A gemma-4-26B-A4B model running on CPU-only with two Xeon 6248R processors achieves 64 tokens per second generation and 285 parallel processing, demonstrating viable performance on 6-year-old hardware. The user highlights the potential for CPU-optimized local LLMs to rival GPU-based systems, emphasizing cost efficiency and accessibility.

arxiv arXiv cs.CL · 2d ago

Bayesian Factorized Adaptation for Code-Switching in Multilingual ASR

A new method called Bayesian factorized adaptation enables high-performance multilingual ASR models to handle code-switching without degrading monolingual performance. It integrates switching-relevant knowledge efficiently using minimal synthetic data, reducing transcription errors by 32.87% and overall WER by 5.31%.

arxiv arXiv cs.CL · 2d ago

SamatNext v0.2-B Achieves Superior Curriculum Retention in Small Code Models

SamatNext v0.2-B, a 356M-parameter hybrid decoder, achieves 100.0% pass rate on Stage 5 and retains 98.8% of Stage 3 semantic behavior in a controlled Python code curriculum. It outperforms a parameter-matched Transformer baseline, which reaches only 97.6% on Stage 5 and retains just 6.0% of Stage 5 behavior, indicating improved retention under sequential fine-tuning.

arxiv arXiv cs.CL · 2d ago

P4IR Framework Improves LLM-Based Code Compliance Accuracy

P4IR, a two-stage framework, uses supervised fine-tuning and Group Relative Policy Optimization to enhance large language model-based automated code compliance systems. It reduces tree edit and token-level Levenshtein distances by up to 23.8% and 38.6% respectively, outperforming leading LLMs like Claude Opus, GPT-5.2, and GLM-4.7 in zero-shot settings with few-shot prompting, and reduces false positives by a statistically significant margin.

media Hugging Face Forums · 2d ago

Buddy System: Rust entropy monitor with NER-gated uncertainty for tiered LLM inference

The Buddy System uses a Rust entropy monitor to detect per-token uncertainty in local Gemma 3 4B inference, routing only uncertain tokens to Sonnet via NER-gated span extraction and semantic retrieval. Benchmarks show it achieves 71.4% accuracy at $0.21, outperforming the Anthropic Advisor pattern (62.9% at $0.44) across seven Hugging Face datasets, with a key improvement on SQuAD v2 by routing source passage chunks to the cloud model.

arxiv arXiv cs.CL · 2d ago

Latent Personal Memory: Dynamic Soft Prompts for LLM Personalization

Latent Personal Memory (LPM) represents user-specific memories as a compact, persistent matrix of N latent slots. These slots are mapped via a shared cross-attention network into dynamic, input-conditioned soft prompts that are prepended to a frozen LLM. LPM outperforms LoRA and Prompt Tuning by up to 8.8% and 54.4% on PersonaMem v1, reduces KV-cache usage by over 64x, matches LoRA accuracy on LoCoMo with 120x fewer parameters, and scales efficiently with context length, outperforming full-context at 128K tokens.