Code generation — korshunov.ai

Code generation Page 9 / 14

IHUBERT: Persian Pretrained Model with Semantic Deduplication

IHUBERT is a monolingual Persian pretrained language model trained on a 45 GB curated subset of the Sepahr-Danesh collection. It uses vector-based semantic deduplication and a domain-balanced pretraining pipeline to improve corpus quality and reduce redundancy, achieving top performance in extractive question answering and strong results in NER and topic classification, though relation extraction remains a challenge.

arxiv arXiv cs.CL · 7d ago

Adaptive LLM Tutoring Improves Engagement and Efficiency

A new adaptive LLM tutoring system uses subject-aware prompting to enhance student engagement. It outperforms static models in simulation and real-world A/B testing, reducing interactions by 3 turns and increasing exercise conversion rates, especially with a stochastic router achieving 28.1%.

arxiv arXiv cs.CL · 7d ago

PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback

PsyScore integrates diagnostic scoring and instructional feedback using a shared latent ability model. It features a trait-adaptive neural IRT scorer based on GPCM, a ZPD-scaffolded feedback generator that tailors instruction by proficiency level, and a multi-perspective evaluation strategy. Experiments on ASAP++ show competitive scoring and more pedagogically aligned feedback compared to existing methods.

blog Simon Willison · 7d ago

Datasette Launches Apps Plugin for Custom HTML Applications

Datasette has released a new plugin, datasette-apps, enabling self-contained HTML+JavaScript applications to run in a secure iframe sandbox. These apps can execute read-only or write SQL queries against Datasette databases, with built-in security features like CSP headers and sandbox restrictions to prevent data exfiltration or unauthorized access.

media r/LocalLLaMA · 7d ago

GLM-5.2 (744B, 2-bit) achieves 7.3 tok/s on 4×3090 with 192GB RAM

GLM-5.2 UD-IQ2_M runs at ~7.3 tokens per second on 4×RTX 3090s with 192GB DDR5 RAM using llama.cpp expert offload. Reducing quantization from IQ2 to IQ1 provided no speed gain, while increasing CPU threads from 6 to 12 improved performance by 22%. Decode is limited by CPU compute, not memory bandwidth, and the offloaded experts must be explicitly distributed across GPUs to avoid out-of-memory errors.

media r/LocalLLaMA · 7d ago

DiffusionGemma 26B on 4090 reaches 475t/s with limitations

DiffusionGemma 26B runs at up to 475t/s on a 4090 via vLLM with INT4 AWQ quantization, achieving speeds between 290t/s and 700t/s based on output length. However, it suffers from single-user operation, lower response accuracy, rapid context loss, and slower time-to-first-token compared to standard 26B models.

media r/LocalLLaMA · 7d ago

Running GLM-5.2 on CPU Only with Local Setup

A user runs GLM-5.2 locally on a Dell PowerEdge R740 with dual Xeon 6248R CPUs and 768GB RAM, using ik_llama.cpp for improved CPU inference. After isolating one NUMA node for optimal performance, they achieve 4–5.5 tokens per second in chat and about 3 tokens per second in coding tasks, noting the model shows 'frontier vibes' during code generation despite limited usability on this hardware.

media r/LocalLLaMA · 7d ago

Repurposing an Old Multi-GPU Node for Local Inference

The node features 8 NVIDIA Quadro RTX 6000 GPUs with 192 GB VRAM and 512 GB RAM, enabling large-scale local AI model inference. Models like LLaMA-3 or Mistral with 8-13 billion parameters could run efficiently here, offering faster, private, and low-latency performance compared to single-GPU setups, making it worthwhile for internal use.

media r/LocalLLaMA · 7d ago

Calibrating 2-bit GGUFs for agentic coding tasks

2-bit quantized versions of Qwopus3.6-27B-Coder, calibrated on real agentic coding logs, achieve a 63% pass rate on SWE-rebench. The IQ2_M quant outperforms non-calibrated versions and rivals Q5_K_M in pass rate despite being half the size, with improved robustness to loops and faster decoding due to a bundled MTP.

media r/LocalLLaMA · 7d ago

North Mini Code: 4-bit quant, Ollama, and OpenRouter support

Cohere Labs has released a 4-bit quantized version of North Mini Code on Hugging Face, reducing its size to approximately 20GB for local execution on devices like Macs. The model is now supported in Ollama, local runtimes based on llama.cpp, and via the OpenRouter API, improving accessibility for developers.

media r/LocalLLaMA · 7d ago

Real-world token cost savings from rtk, headroom, and caveman

A real workload analysis shows headroom, rtk, and caveman reduce token costs by 2.8%, 0.5%, and 0.4% respectively, totaling 3.7% of baseline spending. However, savings are limited by payload diversity, with most traffic being plain text or source code, and the tools only compress structured outputs. Most cost reduction occurs on the cheapest token stream—cache reads—while the tools do not affect prompt caching or output costs, and coverage gaps exist, especially for rtk.

media r/LocalLLaMA · 7d ago

Laguna M.1: 225B Parameter MoE Model for Agentic Coding

Laguna M.1 is a 225B-parameter mixture-of-experts model with 23B activated parameters per token, designed for agentic coding and long-horizon tasks. It achieves competitive performance on SWE-bench Verified (74.6%), SWE-bench Multilingual (63.1%), and Terminal-Bench 2.0 (45.8%), outperforming models like Devstral 2 and GLM-4.7 on key benchmarks.

media r/LocalLLaMA · 7d ago

GLM-5.2 Is The Best Open Weight Creative Writing Model

Sam Paech's Creative Writing Benchmark on EQ Bench ranks GLM-5.2 as the top open-weight creative writing model. The assessment is based on performance metrics from the EQ Bench creative writing evaluation.

media r/LocalLLaMA · 7d ago

SLMs and Diffusion: The Future of Small, Specialized Models?

Users discuss whether task-specific small language models (SLMs) can outperform larger models in specific tasks, citing benchmarks where 9B models match or exceed larger ones. They propose a sequential agentic workflow using multiple specialized models, with one coordinating and others verifying answers, suggesting diffusion models could accelerate such workflows despite reduced intelligence.

media r/LocalLLaMA · 7d ago

Llama Bench vs Real-World Performance Discrepancy

The user reports a significant gap between Llama benchmark results and actual model performance. Benchmarks show 754 tk/s prefill and 36 tk/s generation, but real usage reveals only 7.98 tokens per second, with high latency and poor throughput. The discrepancy is attributed to real-world usage conditions, not benchmark settings, suggesting the model's actual performance is far below the benchmarked speed.

github llama.cpp · 8d ago

LLaMA.cpp Release b9698 Adds Self-Update Support and Multiple Platform Binaries

LLaMA.cpp version b9698 enables self-updates only when built with llama-install.sh. The release includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, including Vulkan, CUDA, OpenVINO, and SYCL.

github AutoGPT · 8d ago

autogpt-platform-beta-v0.6.64 Released

The autogpt-platform-beta-v0.6.64 release, dated 18th June 2026, introduces new features such as the AutoPilot Context Panel and Global Search, along with enhancements to graph saving, caching, and builder performance. It also includes security hardening, bug fixes for LLM provider issues, and UI improvements like a high-resolution touch icon.

github CrewAI · 8d ago

CrewAI v1.14.8a Releases New FlowDefinition Features

CrewAI v1.14.8a introduces script and crew actions to FlowDefinition, adds DMN mode support, and enables flow execution without Python code. It also includes experimental support for JSON-first crews and ZIP deployment fallback, along with improved memory reset and token usage tracking.

github llama.cpp · 8d ago

llama.cpp Release b9693 Adds BF16 Support and Cross-Platform Binaries

llama.cpp version b9693 introduces BF16 support in its concat kernel and provides pre-built binaries for macOS, Linux, Android, Windows, and openEuler. The release includes CPU, Vulkan, ROCm, OpenVINO, SYCL, and HIP variants across multiple architectures, with a dedicated UI package available.

media r/LocalLLaMA · 8d ago

LocalLLaMA proposes crowdsourced coding dataset

A community initiative suggests creating a crowdsourced coding dataset to enable local LLM development. The proposal aims to allow anyone with hardware to contribute data, with more powerful users helping to fine-tune or quantize models, thus reducing reliance on company-released models.