Code generation — korshunov.ai

Code generation Page 1 / 14

Qwen releases 35B-parameter MoE for agent environment simulation

Qwen has launched Qwen-AgentWorld-35B-A3B, a 35B-parameter MoE model with only about 3B active parameters per token. It is trained to simulate responses from MCP, terminal, software engineering, Android, web, and OS GUI environments by predicting next observations after agent actions, enabling efficient agent training and environment simulation without real tool execution.

arxiv arXiv cs.CL · 1d ago

SHERLOC: Structured Diagnostic Localization for Code Repair Agents

SHERLOC introduces a training-free framework that pairs a reasoning LLM with compact repository tools and self-recovery. It achieves state-of-the-art localization accuracy and recall on SWE-Bench, improving repair agents' resolve rate by 5.95 percentage points while reducing localization and total token usage by 36.7% and 23.1% respectively.

arxiv arXiv cs.CL · 1d ago

Match Task to Objective Framework for Encoder-Decoder Models

This study introduces the Match Task to Objective (MTO) framework to align pre-training and fine-tuning objectives with specific tasks. The framework enables automated, unsupervised data adaptation and delivers performance gains of over 120% in few-shot settings, outperforming baselines in both few-shot and full-dataset scenarios. It also enhances prompt-tuning by providing effective soft prompt engineering guidance.

github OpenAI Agents SDK · 1d ago

Release of openai-agents-python v0.17.7

Version 0.17.7 of the openai-agents-python library includes new features such as configurable WebSocket max size and buffered Chat Completions tool-call streaming. It also contains multiple fixes for issues including sandbox buffering, error handling, and tool dispatch, along with documentation updates and improved error messaging.

arxiv arXiv cs.CL · 1d ago

Metis: Bridging Text and Code Memory for Self-Evolving Agents

Metis introduces a hierarchical dual-representation memory that combines text and code memory to improve self-evolving agents. It organizes experience into execution plans, facts, and pitfalls, crystallizing reusable plans into validated tools only when justified. Evaluated on AppWorld, Metis achieves up to 20.6% higher task accuracy and 22.8% lower execution cost than ReAct, with better overall balance across accuracy, efficiency, and memory cost.

arxiv arXiv cs.CL · 1d ago

Bayesian Control for Coding Agents

Bayesian control improves tool-use decisions in coding agents by modeling uncertainty and dynamically choosing actions. It outperforms fixed-rule orchestrators, especially when verification is costly and critics provide informative but imperfect feedback. The method also produces a more interpretable correctness score than token-probability or raw tool-success metrics.

arxiv arXiv cs.CL · 1d ago

NatureBench Evaluates AI Coding Agents' Scientific Discovery Capabilities

NatureBench presents a benchmark of 90 tasks from Nature-family papers to assess AI coding agents' ability to achieve scientific discovery. Under a web-search-disabled protocol, the top model exceeds prior state-of-the-art on only 17.8% of tasks. Agents primarily succeed by translating scientific problems into supervised learning tasks, not through original scientific invention.

github CrewAI · 1d ago

CrewAI 1.14.8a3 Release Notes

CrewAI 1.14.8a3 introduces unified declarative flow loading and improved startup UX for crew runs. It consolidates crewai run and flow kickoff commands, adds declarative Flow CLI support, and enables @router() as a flow start method with typed output schemas for tools.

media r/LocalLLaMA · 1d ago

Mimo 2.5 is fast at large context on dual RTX Pro 6000

Mimo 2.5 maintains fast performance at large context lengths on dual RTX Pro 6000 cards using a 5-to-1 local/global sliding-window attention mechanism, similar to Gemma 3. It completes tasks in about 4 minutes, significantly faster than MiniMax M3, which takes around 40 minutes, despite both models having similar quality under VRAM limits.

blog Simon Willison · 1d ago

datasette 1.0a35 releases new table creation and alteration features

Datasette 1.0a35 introduces a new "Create table" interface with support for defining columns, constraints, and foreign keys via its JSON API. It also adds an "Alter table" action that allows modifying existing tables, including column changes, type adjustments, and dropping columns or tables, with a stable template context API for custom templates until Datasette 2.0.

arxiv arXiv cs.AI · 1d ago

LLMs Benchmarked for Web Vulnerability Detection

A study evaluates six LLMs on detecting real-world web vulnerabilities in WordPress plugins, finding detection rates vary by model and prompt design. Claude Opus 4.6 achieved the highest detection rate at 63%, while Qwen 3.5 only reached 35%, and no model consistently identified all baseline vulnerabilities across iterations.

media r/LocalLLaMA · 1d ago

650+ Apache-2.0 biomedical NER/de-ID models run 30-40x faster on Apple Silicon

A new open-source project offers 650+ Apache-2.0 licensed biomedical NER and de-identification models that run on-device via MLX. On a 3-year-old MacBook Pro with M3 Max, clinical NER models achieve 30-40x speedups over PyTorch-CPU with identical fp32 outputs and entity results, due to architectural efficiency on Apple Silicon. The models, including 434M biomedical NER and PII de-ID, are publicly available on Hugging Face and GitHub, with full reproducibility provided in code and methodology.

arxiv arXiv cs.AI · 1d ago

CORTIS: Text-Only Adaptation of Spoken Language Models

CORTIS enables task-oriented voice agents to generate structured speech outputs by fine-tuning spoken language models using only text-form task supervision. It outperforms ASR-LLM cascades under acoustic degradation, especially in preserving high-level task semantics, without requiring paired speech-target annotations during training.

arxiv arXiv cs.AI · 1d ago

Benchmark Evaluation of Small Language Models for Arabic NLP

A benchmark of 240 Arabic test items across eight domains and ten skills assesses twelve small language models in zero-shot settings. Gemma 3 (12B) achieved the highest overall score (4.548/5), followed by Aya and C4AI Command Arabic, with performance linked more to Arabic alignment and instruction-following than model size. Common failure modes include prompt leakage, hallucination, and weak task adherence.

media r/LocalLLaMA · 1d ago

MiniMax 2.7 Runs on 47TG 1200PP with 96GB VRAM

MiniMax 2.7, a 47 tera-parameter model, operates on a 96GB VRAM system with 192GB DDR5 RAM using an MSI B840 board and 9900X CPU. It runs as an agent-class model with strong instruction following and tool calling, supported by a round-robin loop with three CPU-based sequencing agents and a dense 12B model that monitors for errors.

lab Claude Code Releases · 1d ago

Claude v2.1.187 Release Notes

Claude v2.1.187 introduces sandbox credentials blocking, org-configured model restrictions, mouse click support in fullscreen, and fixes for command failures, tool hangs, and UI stability. Updates also improve structured output handling, agent depth tracking, and plugin management, with enhancements to VSCode and terminal compatibility.

media r/LocalLLaMA · 1d ago

Tmax-27B Terminal Agent for Small GPUs with DPPO Training

Tmax-27B is a terminal agent based on Qwen3.6-27B, trained with DPPO (RL), achieving 43% on Terminal Bench 2.0 and 69% on TB Lite. To run on consumer GPUs, it is quantized using importance-matrix-calibrated GGUF models from 2 to 5 bits per weight, with a grafted MTP head enabling speculative decoding. IQ2_XS at 8.5 GiB achieves 70% pass rate in agentic coding tasks, outperforming plain quantization and demonstrating stable tool-call generation.

blog Simon Willison · 2d ago

OPFS + Pyodide test harness for browser-based SQLite editing

A test harness has been developed to explore using OPFS (Origin Private File System) with Pyodide to enable browser-based editing of persistent SQLite files. The tool is designed to test Datasette Lite's capability to modify local SQLite databases directly in the browser across different browsers.

media r/LocalLLaMA · 2d ago

New Qwen-27B IQ4_KS and IQ4_KS_KT Quantizations for ik_llama.cpp

Two new GGUF quantizations for Qwen-27B have been released for ik_llama.cpp, optimized for 16GB VRAM on NVIDIA GPUs. The first, Qwen3.6-27B.i1-IQ4_KS-attn_qkv-IQ4_KS.gguf, improves logical reasoning at the cost of general knowledge, with a perplexity of 7.4131. The second, Qwen3.6-27B.i1-IQ4_KS_KT-attn_qkv-IQ4_KS.gguf, applies Trellis quantization (iq4_kt) selectively to tensors with near-Gaussian distributions, achieving a perplexity of 7.4091, showing minimal performance degradation.

media r/LocalLLaMA · 2d ago

Can GLM5.2 be run on 4x AMD EPYC servers with 512GB RAM each?

The user asks if a 467GB GLM 5.2 model can be run on four servers, each with 512GB RAM and 409.6 GB/s memory bandwidth, using CPU-only inference with Unsloth. They consider splitting the model across nodes for token speed or using 8-bit versions in dual clusters to handle larger models and improve performance.