r/LocalLLaMA — korshunov.ai

Source · r/LocalLLaMA

EU AI Act mandates AI-generated text watermarking from August 2024

The EU AI Act requires all AI systems generating synthetic text to include machine-readable, detectable watermarks using robust, interoperable technical solutions with two layers. This applies to all AI models, including open-source ones, and extends to any service accessible by EU citizens, regardless of location. Non-compliance risks fines of up to 35 million euros or a percentage of annual income, with providers of 'systemic risk' AI models facing heightened liability.

media r/LocalLLaMA · 6d ago

GLM-5.2 Outperforms GPT-5.5 in AA-Briefcase Evaluation

Artificial Analysis' new agentic knowledge work evaluation, AA-Briefcase, shows GLM-5.2 surpassing GPT-5.5 in performance. The benchmark assesses real-world task execution and reasoning capabilities in knowledge work scenarios.

media r/LocalLLaMA · 8d ago

GLM-5.2 crosses 80% on Terminal-Bench

GLM-5.2 is the first open-weights model to achieve 80% accuracy on Terminal-Bench and outperforms all other available open models. It also surpasses Gemini, positioning it as a frontier-level model at a significantly lower cost.

media r/LocalLLaMA · 9d ago

HalBench Tests 29 Open Source Models on Sycophancy and Hallucination

HalBench evaluates 29 open-source LLMs on a custom benchmark for sycophancy and hallucination. Qwen 3.6 and Gemma 4 outperform larger models, with Qwen 3.6 achieving 36.6% pushback—higher than GPT-5.4 and Gemini 3.1 Pro. Model size does not correlate with honest responses, indicating that architecture and training data matter more than parameters.

media r/LocalLLaMA · 12h ago

OpenAI and Broadcom Unveil LLM-Optimized Inference Chip

Early testing shows the first-generation chip delivers significantly better performance per watt than current state-of-the-art solutions. Built from the ground up for current and future large language models, the chip expands OpenAI's full-stack platform and will be deployed at gigawatt scale with data center partners across multiple generations.

media r/LocalLLaMA · 16h ago

Qwen-AgentWorld-35B-A3B for Coding?

The Qwen-AgentWorld-35B-A3B model shows strong performance in coding tasks, with a 65.63% score on Software Writing Evaluation and 65.92% overall benchmark. It outperforms Qwen3.5-35B-A3B and rivals larger models in agent-based tasks, with a first impression noting superior accuracy in long-term agent workflows.

media r/LocalLLaMA · 16h ago

Gemma 4 26BA4B Surprisingly Usable at IQ3_S

A user reports that Gemma 4 26B quantized to Q3 runs at 25 tokens per second on a MacBook Air, performing nearly as well as bf16 for non-coding, tool-calling tasks. They question whether this performance reflects confirmation bias or if small quantized models are genuinely usable.

media r/LocalLLaMA · 17h ago

Baidu's Unlimited-OCR Transcribes Dozens of Pages in One Forward Pass

Baidu has released Unlimited-OCR, a model that transcribes dozens of pages in a single forward pass using Reference Sliding Window Attention (R-SWA). It builds on DeepSeek-OCR, inheriting its encoder, image compression, and MoE architecture, with only 500M active parameters per token. The model achieves 93.92% accuracy on OmniDocBench v1.6, outperforming DeepSeek-OCR's 87.01% on v1.5, though vendor-reported results warrant independent validation.

media r/LocalLLaMA · 17h ago

Qwen3.6 27B more dumb in vLLM compared to llama.cpp

A user reports that Qwen3.6-27B runs significantly less intelligently in vLLM than in llama.cpp, exhibiting issues like ignoring messages, hallucinating tool calls, and failing to recognize prior conversation context. Despite proper configuration and prompt templates, the model appears to lose coherence and misinterprets its own tool usage, with errors occurring consistently rather than sporadically.

media r/LocalLLaMA · 17h ago

KaLM-Reranker-V1: Fast and Efficient Document Reranking

KaLM-Reranker-V1 is a fast but not late-interaction reranker that decouples query and passage computation while maintaining strong relevance modeling through cross-attention. It achieves state-of-the-art performance on BEIR, outperforms industrial models like Qwen3-Reranker, and shows excellent results on MIRACL and LMEB, with the 0.27B Nano model remaining competitive against 7-12B models.

media r/LocalLLaMA · 21h ago

Qwen releases 35B-parameter MoE for agent environment simulation

Qwen has launched Qwen-AgentWorld-35B-A3B, a 35B-parameter MoE model with only about 3B active parameters per token. It is trained to simulate responses from MCP, terminal, software engineering, Android, web, and OS GUI environments by predicting next observations after agent actions, enabling efficient agent training and environment simulation without real tool execution.

media r/LocalLLaMA · 1d ago

Mimo 2.5 is fast at large context on dual RTX Pro 6000

Mimo 2.5 maintains fast performance at large context lengths on dual RTX Pro 6000 cards using a 5-to-1 local/global sliding-window attention mechanism, similar to Gemma 3. It completes tasks in about 4 minutes, significantly faster than MiniMax M3, which takes around 40 minutes, despite both models having similar quality under VRAM limits.

media r/LocalLLaMA · 1d ago

650+ Apache-2.0 biomedical NER/de-ID models run 30-40x faster on Apple Silicon

A new open-source project offers 650+ Apache-2.0 licensed biomedical NER and de-identification models that run on-device via MLX. On a 3-year-old MacBook Pro with M3 Max, clinical NER models achieve 30-40x speedups over PyTorch-CPU with identical fp32 outputs and entity results, due to architectural efficiency on Apple Silicon. The models, including 434M biomedical NER and PII de-ID, are publicly available on Hugging Face and GitHub, with full reproducibility provided in code and methodology.

media r/LocalLLaMA · 1d ago

MiniMax 2.7 Runs on 47TG 1200PP with 96GB VRAM

MiniMax 2.7, a 47 tera-parameter model, operates on a 96GB VRAM system with 192GB DDR5 RAM using an MSI B840 board and 9900X CPU. It runs as an agent-class model with strong instruction following and tool calling, supported by a round-robin loop with three CPU-based sequencing agents and a dense 12B model that monitors for errors.

media r/LocalLLaMA · 1d ago

Tmax-27B Terminal Agent for Small GPUs with DPPO Training

Tmax-27B is a terminal agent based on Qwen3.6-27B, trained with DPPO (RL), achieving 43% on Terminal Bench 2.0 and 69% on TB Lite. To run on consumer GPUs, it is quantized using importance-matrix-calibrated GGUF models from 2 to 5 bits per weight, with a grafted MTP head enabling speculative decoding. IQ2_XS at 8.5 GiB achieves 70% pass rate in agentic coding tasks, outperforming plain quantization and demonstrating stable tool-call generation.

media r/LocalLLaMA · 1d ago

GLM 5.2 on Mac Studio Speedup PR

GLM 5.2 delivers improved prefill speeds exceeding 100 t/s at higher context lengths. The update reduces memory usage, enabling 4-bit quantized models to handle over 100k context tokens efficiently. This enhancement is detailed in a PR by the oMLX creator.

media r/LocalLLaMA · 1d ago

LLM Medical Scribing Benchmark: Omissions Outnumber Hallucinations

A benchmark of 8 LLMs on 300 synthetic doctor-patient dialogues found 12 high-impact hallucinations and 520 clinically relevant omissions. Omissions were far more common than hallucinations, with DeepSeek excelling in prose and cost but missing many safety facts, while Claude Opus had fewest omissions but poorer prose quality.

media r/LocalLLaMA · 1d ago

7 Chinese companies shipping H100/H200-class AI chips, most IPO'd in last 6 months

At least seven Chinese companies are now shipping H100/H200-class AI accelerators, with most having gone public within the last six months. Huawei alone shipped 812,000 AI cards last year, accounting for 49% of China's domestic supply, and its Ascend 950 is reportedly targeted at H200-class performance. Several of these firms were founded by former NVIDIA and AMD GPU leaders, including MetaX, which saw revenue grow 3,800x in three years, and Alibaba, which launched a server with 1.5TB of VRAM for on-premises frontier model deployment.

media r/LocalLLaMA · 1d ago

VibeThinker: 3B-parameter model beats Opus 4.5 in reasoning

VibeThinker, a 3-billion-parameter language model, outperforms Opus 4.5 in reasoning tasks using a novel SFT+GRPO training approach. The model was introduced in a paper available on arXiv, with details shared in a Reddit post.

media r/LocalLLaMA · 1d ago

KLD Analysis of KV Cache Quantization for Qwen3.6-35B-A3B and Gemma4-E2B QAT

A detailed analysis maps the KLD (Kullback-Leibler divergence) of KV cache quantization for Qwen3.6-35B-A3B and Gemma4-E2B models. Results show q8/q8 quantization is nearly lossless on both models, while q4/q4 performs well on Qwen but causes severe degradation on Gemma. Turbo quantization variants show mixed performance, with turbo3 and turbo2 enabling extreme cache compression at significant accuracy cost.