All articles — korshunov.ai

All articles Page 1 / 130

GLM-5.2 Outperforms GPT-5.5 in AA-Briefcase Evaluation

Artificial Analysis' new agentic knowledge work evaluation, AA-Briefcase, shows GLM-5.2 surpassing GPT-5.5 in performance. The benchmark assesses real-world task execution and reasoning capabilities in knowledge work scenarios.

blog Simon Willison · 13d ago

datasette-apps 0.1a2 Release Notes

datasette-apps 0.1a2 introduces a new apps-set-csp permission to guard custom network and CSP origins, with an optional allow-list for non-privileged users. The release also improves keyboard navigation in the stored query picker and fixes issues with link confirmation and logging panels in full-screen mode.

blog Simon Willison · 13d ago

datasette-apps 0.1a3 Release

datasette-apps 0.1a3 fixes a bug allowing users without create-app permission to create apps. It also resolves an issue where non-owners could edit private apps, aligning edit and delete permissions with view permissions.

blog Simon Willison · 13d ago

datasette-acl 0.6a0 Release

datasette-acl 0.6a0 expands permissions from table-level to general resource sharing. The plugin enables multi-user Datasette instances to grant fine-grained access control over resources.

blog Simon Willison · 13d ago

Datasette Launches Apps Plugin for Custom HTML Applications

Datasette has released a new plugin, datasette-apps, enabling self-contained HTML+JavaScript applications to run in a secure iframe sandbox. These apps can execute read-only or write SQL queries against Datasette databases, with built-in security features like CSP headers and sandbox restrictions to prevent data exfiltration or unauthorized access.

media r/LocalLLaMA · 13d ago

GLM-5.2 (744B, 2-bit) achieves 7.3 tok/s on 4×3090 with 192GB RAM

GLM-5.2 UD-IQ2_M runs at ~7.3 tokens per second on 4×RTX 3090s with 192GB DDR5 RAM using llama.cpp expert offload. Reducing quantization from IQ2 to IQ1 provided no speed gain, while increasing CPU threads from 6 to 12 improved performance by 22%. Decode is limited by CPU compute, not memory bandwidth, and the offloaded experts must be explicitly distributed across GPUs to avoid out-of-memory errors.

media r/LocalLLaMA · 13d ago

LQ50/LQ50-24GB costs around $1200

A user reported finding the LQ50 and LQ50-24GB models on TAOBAO, noting they are expensive. The post highlights the cost as approximately $1200.

media r/LocalLLaMA · 13d ago

DiffusionGemma 26B on 4090 reaches 475t/s with limitations

DiffusionGemma 26B runs at up to 475t/s on a 4090 via vLLM with INT4 AWQ quantization, achieving speeds between 290t/s and 700t/s based on output length. However, it suffers from single-user operation, lower response accuracy, rapid context loss, and slower time-to-first-token compared to standard 26B models.

media r/LocalLLaMA · 13d ago

What's the best open speech to text today?

A user is seeking recommendations for real-time speech-to-text tools with diarization capabilities, asking about alternatives to Wispr Flow and MacParakeet, which uses Parakeet and Whisper models. They inquire whether newer models have emerged to support real-time performance.

media r/LocalLLaMA · 13d ago

Running GLM-5.2 on CPU Only with Local Setup

A user runs GLM-5.2 locally on a Dell PowerEdge R740 with dual Xeon 6248R CPUs and 768GB RAM, using ik_llama.cpp for improved CPU inference. After isolating one NUMA node for optimal performance, they achieve 4–5.5 tokens per second in chat and about 3 tokens per second in coding tasks, noting the model shows 'frontier vibes' during code generation despite limited usability on this hardware.

github LangGraph · 13d ago

langgraph releases version 1.2.6

LangGraph releases version 1.2.6, fixing a regression where nested subgraphs incorrectly inherit parent checkpoint_ns. The update also improves cancellation of running subgraphs during stream aborts and includes a CLI version update to 0.4.30.

media r/LocalLLaMA · 13d ago

Repurposing an Old Multi-GPU Node for Local Inference

The node features 8 NVIDIA Quadro RTX 6000 GPUs with 192 GB VRAM and 512 GB RAM, enabling large-scale local AI model inference. Models like LLaMA-3 or Mistral with 8-13 billion parameters could run efficiently here, offering faster, private, and low-latency performance compared to single-GPU setups, making it worthwhile for internal use.

github CrewAI · 13d ago

v1.14.8a1 Release Notes

Version 1.14.8a1 adds an optional if expression to each.do steps and fixes JSON crew issues. The snapshot and changelog for v1.14.8a have been updated. Contributors include @joaomdmoura and @vinibrsl.

media r/LocalLLaMA · 13d ago

Local Qwen isn't a worse Opus, it's a different tool

The article argues that Local Qwen is not inferior to Opus, but rather serves a different purpose. It emphasizes that each model is designed for specific use cases, and comparing them directly overlooks their distinct capabilities and intended applications.

media r/LocalLLaMA · 13d ago

Calibrating 2-bit GGUFs for agentic coding tasks

2-bit quantized versions of Qwopus3.6-27B-Coder, calibrated on real agentic coding logs, achieve a 63% pass rate on SWE-rebench. The IQ2_M quant outperforms non-calibrated versions and rivals Q5_K_M in pass rate despite being half the size, with improved robustness to loops and faster decoding due to a bundled MTP.

media Latent Space · 13d ago

Why AI Scaling Is a Systems Problem, Not Just a GPU Race

The AI scaling debate overlooks that maximizing model FLOP utilization is more critical than buying more GPUs. Frontiers like xAI operate at sub-10% MFU, while historical models achieved 21% to 70% MFU, indicating systemic inefficiencies in scheduling, networking, and cluster management. Anjney Midha argues that AI infrastructure must evolve into efficient, aligned, and responsible systems, with 'output maxing' emerging as a new discipline for frontier AI.

media r/LocalLLaMA · 13d ago

North Mini Code: 4-bit quant, Ollama, and OpenRouter support

Cohere Labs has released a 4-bit quantized version of North Mini Code on Hugging Face, reducing its size to approximately 20GB for local execution on devices like Macs. The model is now supported in Ollama, local runtimes based on llama.cpp, and via the OpenRouter API, improving accessibility for developers.

media r/LocalLLaMA · 13d ago

LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M Released

LFM2.5-Embedding-350M is a dense bi-encoder that provides fast multilingual retrieval with one vector per document, achieving best-in-class accuracy for its size and inference speed comparable to smaller models. LFM2.5-ColBERT-350M is a late interaction retriever with best-in-class multilingual accuracy, enabling cross-lingual retrieval by storing one vector per token and supporting retrieval in multiple languages with high precision. Both models are designed as drop-in replacements for existing RAG pipelines.

media r/LocalLLaMA · 13d ago

Real-world token cost savings from rtk, headroom, and caveman

A real workload analysis shows headroom, rtk, and caveman reduce token costs by 2.8%, 0.5%, and 0.4% respectively, totaling 3.7% of baseline spending. However, savings are limited by payload diversity, with most traffic being plain text or source code, and the tools only compress structured outputs. Most cost reduction occurs on the cheapest token stream—cache reads—while the tools do not affect prompt caching or output costs, and coverage gaps exist, especially for rtk.

media r/LocalLLaMA · 13d ago

Laguna M.1: 225B Parameter MoE Model for Agentic Coding

Laguna M.1 is a 225B-parameter mixture-of-experts model with 23B activated parameters per token, designed for agentic coding and long-horizon tasks. It achieves competitive performance on SWE-bench Verified (74.6%), SWE-bench Multilingual (63.1%), and Terminal-Bench 2.0 (45.8%), outperforming models like Devstral 2 and GLM-4.7 on key benchmarks.