r/LocalLLaMA — korshunov.ai

Source · r/LocalLLaMA

The EU AI Act requires all AI systems generating synthetic text to include machine-readable, detectable watermarks using robust, interoperable technical solutions with two layers. This applies to all AI models, including open-source ones, and extends to any service accessible by EU citizens, regardless of location. Non-compliance risks fines of up to 35 million euros or a percentage of annual income, with providers of 'systemic risk' AI models facing heightened liability.

media r/LocalLLaMA · 6d ago

GLM-5.2 Outperforms GPT-5.5 in AA-Briefcase Evaluation

Artificial Analysis' new agentic knowledge work evaluation, AA-Briefcase, shows GLM-5.2 surpassing GPT-5.5 in performance. The benchmark assesses real-world task execution and reasoning capabilities in knowledge work scenarios.

media r/LocalLLaMA · 8d ago

GLM-5.2 crosses 80% on Terminal-Bench

GLM-5.2 is the first open-weights model to achieve 80% accuracy on Terminal-Bench and outperforms all other available open models. It also surpasses Gemini, positioning it as a frontier-level model at a significantly lower cost.

media r/LocalLLaMA · 9d ago

HalBench Tests 29 Open Source Models on Sycophancy and Hallucination

HalBench evaluates 29 open-source LLMs on a custom benchmark for sycophancy and hallucination. Qwen 3.6 and Gemma 4 outperform larger models, with Qwen 3.6 achieving 36.6% pushback—higher than GPT-5.4 and Gemini 3.1 Pro. Model size does not correlate with honest responses, indicating that architecture and training data matter more than parameters.

media r/LocalLLaMA · 1d ago

Baidu Releases One-shot Long-horizon Parsing

Baidu has introduced a new parsing model called One-shot Long-horizon Parsing. The model enables efficient, long-range understanding of text with minimal training data, as demonstrated in a GitHub repository.

media r/LocalLLaMA · 2d ago

My new benchmark: how good are LLMs at simulating wetting behavior?

A new LLM micro-benchmark evaluates how well large language models can simulate solid-liquid interfaces using Surface Evolver, a 1992 tool for modeling liquid surfaces. The benchmark requires LLMs to write SE datafiles defining geometry and constraints through an iterative agentic process with objective grading, offering a niche task with real scientific relevance and sparse training data.

media r/LocalLLaMA · 2d ago

CPU-only TTS benchmark: Kokoro 82M vs Supertonic 3 vs Inflect-Nano-v1

A CPU-only text-to-speech benchmark compares Kokoro-82M, Supertonic-3, and Inflect-Nano-v1 on an Intel Xeon with 4 cores and 15.6GB RAM. Kokoro delivers the most natural sound (MOS 4.44-4.45) despite slower speed, with ONNX version outperforming PyTorch in real-time factor while maintaining identical quality. Supertonic-5-step achieves a balanced result at 3.2x real-time and MOS 4.37, making it the practical choice for usability and quality.

media r/LocalLLaMA · 2d ago

Reusable workflows for long-running local LLMs

Hayden has developed the knot harness to manage long-running local LLM tasks. It enables reusable workflows with agent profiles, file system event monitoring, and automatic triggers, using Pi.dev as the default agent.

media r/LocalLLaMA · 2d ago

Best local models for reasoning in agentic AI

The creator of EverFern asks which local models work best for agentic workflows and browser/computer use. They note that model intelligence is rarely the bottleneck, with reliability and recovery systems being more critical than model choice.

media r/LocalLLaMA · 2d ago

Human Evaluation Shows GLM-5.2 Competes with Top Models

A human evaluation on Design Arena's leaderboard reveals GLM-5.2 performs nearly as well as Fable 5 in game development tasks, placing just one step below it. The model, based on open weights and MIT licensing, is assessed as equivalent in capability to the best available Claude models, suggesting that standardized benchmarks may no longer accurately reflect real-world performance.

media r/LocalLLaMA · 2d ago

SFT or RL-first for Qwen 3.5 Tool Agent Training?

A user asks whether supervised fine-tuning (SFT) followed by reinforcement learning (RL) is still recommended for training Qwen 3.5 4B or 9B agents for multi-tool use, or if RL-only approaches yield better results. The post also seeks guidance on reward design and handling parallel tool execution in agent workflows.

media r/LocalLLaMA · 2d ago

Boogu-Image-0.1: Open-Source Unified Image Generation and Editing Model Series

Boogu-Image-0.1 is an Apache-2.0 licensed open-source unified image generation and editing model family, including Base, Turbo, and Edit variants. It offers high-quality text-to-image generation, fast generation, image editing, and strong Chinese-English text rendering, with training data scale roughly one order of magnitude smaller than closed-source systems yet achieving competitive performance through improved model understanding and data quality.

media r/LocalLLaMA · 2d ago

Who needs GPUs? 64 t/s gen, 285 PP on 6-year-old CPUs

A gemma-4-26B-A4B model running on CPU-only with two Xeon 6248R processors achieves 64 tokens per second generation and 285 parallel processing, demonstrating viable performance on 6-year-old hardware. The user highlights the potential for CPU-optimized local LLMs to rival GPU-based systems, emphasizing cost efficiency and accessibility.

media r/LocalLLaMA · 2d ago

MCP servers consume context window via tool definitions

Each MCP server dumps its full tool list into the model's context before any prompt, using up to 24,000 tokens for 62 tools. A local gateway implementing lazy discovery reduces tool-definition overhead by 97%, cutting token usage from ~24k to ~660 per request, with 90% fewer total tokens over a task, without affecting task success rate.

media r/LocalLLaMA · 2d ago

Microsoft Releases Open Source FastContext for LLM Coding Agents

Microsoft has open-sourced FastContext-1.0, a lightweight repository-exploration subagent that separates code repository exploration from task solving in LLM coding agents. It uses parallel read-only tool calls to return compact file paths and line ranges, improving end-to-end accuracy and reducing token usage by up to 60.3%, with the 4B-RL model outperforming a 30B-SFT model on SWE-bench Pro.

media r/LocalLLaMA · 2d ago

Gemma 4's Potential to Outperform Mistral and Qwen3.6 Through Finetuning

Gemma 4 shows strong base performance and unique features like global MTP support, QAT, and out-of-the-box vision capabilities. While it currently lacks widespread finetunes, models like MeroMero, Equinox, and Gembrain have already demonstrated high quality, suggesting that with community effort, Gemma 4 could surpass Mistral or Qwen3.6 in specific tasks like coding and creative writing.

media r/LocalLLaMA · 2d ago

DeepSeek Raises $7.4B at $60B Valuation, Liang Wenfeng Invests $3B

DeepSeek has raised $7.4 billion in funding at a $60 billion valuation. Liang Wenfeng, the company's founder, personally invested $3 billion in the round, underscoring his significant stake and commitment to the company's growth.

media r/LocalLLaMA · 3d ago

TMax: A Simple Recipe for Terminal Agents

TMax introduces TMax-15k, a dataset of 14,600 RL environments, over 2.5× larger than the next-largest open terminal dataset. It also presents a simple RL recipe that trains open models from 2B to 27B parameters, with TMax-9B achieving 27.2% on Terminal Bench 2.0 and TMax-27B reaching 42.7%.

media r/LocalLLaMA · 3d ago

Updated Vision Model Benchmark Results and Recommendations

A revised benchmark of local vision language models evaluates 23 models across 30 images with 3 tests each, totaling 2,070 tests and 60 to 70 inference hours. The top-performing model is Qwen3.6 27B (nothink) at Q4 with a 79.6 score, followed by Qwen3.5 4B (nothink) at Q4, and Qwen3-VL 8B at Q8. Key findings include thinking mode degrading vision performance, MoE models underperforming compared to dense models, and Q8 quantization not universally improving results.

media r/LocalLLaMA · 4d ago

GLM-5.2 Beats Gemini and GPT-5.4 in Coding but Is Inefficient

GLM-5.2 surpasses GPT-5.4 and the entire Gemini lineup in coding performance on the DeepSWE benchmark. However, it requires significantly more output tokens, making it substantially less efficient in terms of cost-per-task compared to models like GPT-5.5 and Claude Opus 4.8.

EU AI Act mandates AI-generated text watermarking from August 2024

GLM-5.2 Outperforms GPT-5.5 in AA-Briefcase Evaluation

GLM-5.2 crosses 80% on Terminal-Bench

HalBench Tests 29 Open Source Models on Sycophancy and Hallucination

Baidu Releases One-shot Long-horizon Parsing

My new benchmark: how good are LLMs at simulating wetting behavior?

CPU-only TTS benchmark: Kokoro 82M vs Supertonic 3 vs Inflect-Nano-v1

Reusable workflows for long-running local LLMs

Best local models for reasoning in agentic AI

Human Evaluation Shows GLM-5.2 Competes with Top Models

SFT or RL-first for Qwen 3.5 Tool Agent Training?

Boogu-Image-0.1: Open-Source Unified Image Generation and Editing Model Series

Who needs GPUs? 64 t/s gen, 285 PP on 6-year-old CPUs

MCP servers consume context window via tool definitions

Microsoft Releases Open Source FastContext for LLM Coding Agents

Gemma 4's Potential to Outperform Mistral and Qwen3.6 Through Finetuning

DeepSeek Raises $7.4B at $60B Valuation, Liang Wenfeng Invests $3B

TMax: A Simple Recipe for Terminal Agents

Updated Vision Model Benchmark Results and Recommendations

GLM-5.2 Beats Gemini and GPT-5.4 in Coding but Is Inefficient