Code generation — korshunov.ai

Code generation Page 1 / 14

Donate your coding sessions to an open CC-BY-4.0 dataset

A project called Trace Commons invites users to donate their coding session traces to an open dataset licensed under CC-BY-4.0. The initiative aims to provide training data for open-weight and open-source AI models, countering potential data monopolies by Anthropic and OpenAI.

media r/LocalLLaMA · 9d ago

AeroLLM: Fast, open-source AI app for Apple Silicon

AeroLLM is a fast, optimised, and open-source chat application designed for Apple Silicon devices using the MLX backend. It supports local AI tasks like text-to-speech, speech-to-text, and large language models, with models downloaded directly from Hugging Face based on available RAM. The app is notarised due to lack of Apple Developer membership, but users can follow provided steps to run it as a signed macOS app.

media r/LocalLLaMA · 9d ago

Nex-N2 Pro is the real deal

The user found that N2 Pro, when using Rio's chat template, performs reliably on their 128G Mac. It passed a private benchmark on llama.cpp source code 100% of the time without hallucinations, matching only GPT 5.x in consistency.

media r/LocalLLaMA · 9d ago

Are small local models for automation a thing?

A Reddit user argues that small, efficient local LLMs (1B to 4B parameters) embedded in scripts can enable practical automation of repetitive tasks. They note this use case is underrepresented in discussions focused on coding assistants or hardware performance, suggesting a gap in community interest or visibility for task-specific, lightweight AI models.

media r/LocalLLaMA · 9d ago

Nex2 mini Phase Twin 16GB footprint, 30B model released

The Nex2 mini Phase Twin, a 30B parameter model with 16GB footprint, is now available for Intel users, particularly the A770 lineup. It performs at 89 tokens per second on a single A770 card and is optimized to use the appropriate kernel based on hardware, with enhanced performance when paired with two cards.

arxiv arXiv cs.CL · 9d ago

Key Properties for Effective Code Interpreter Reasoning

A study identifies extrinsic (crucial tokens) and intrinsic (cognitive behaviors) properties that enhance code interpreter reasoning in large language models. Stronger reasoning models show higher prevalence of verification, backtracking, and backward chaining, with these properties improving performance during inference and training, reducing overthinking and boosting token efficiency.

arxiv arXiv cs.CL · 9d ago

Post-Hoc Operators Fail to Improve Accuracy in Small Code Models

A measurement study finds that 26 semantic post-hoc operators do not improve held-out accuracy over Best-of-N in frozen small code models. While two operators—expression-layer recovery and adaptive consensus early-stop—offer benefits in compute efficiency or program recovery, none outperform BoN in accuracy. The results highlight systemic limitations in error detection and coverage, suggesting that model harnesses and error coverage must be improved before post-hoc reasoning is considered.

arxiv arXiv cs.LG · 9d ago

Key Properties for Effective Code Interpreter Reasoning

arxiv arXiv cs.LG · 9d ago

Fingerprinting agent behavior through procedural trajectories

We introduce a method to identify agents by their procedural behavior fingerprints, achieving 85.7% accuracy in attributing unseen trajectories to correct agents. Using ProcGrep, we analyze coding agent behavior in SWE-Bench, finding that models from similar release periods or distilled from each other exhibit closer behavioral similarity, with a Jensen-Shannon divergence of 0.25.

arxiv arXiv cs.LG · 9d ago

Post-Hoc Falsification Operators Fail to Improve Accuracy in Small Code Models

A measurement study finds that 26 semantic post-hoc operators do not improve held-out accuracy over Best-of-N in frozen small code models. While some operators reduce compute usage or recover correct programs, none outperform BoN in accuracy, due to systemic limitations like coverage walls and consensus traps. An expression-layer recovery (M1) improves performance on HumanEval+ by 12 tasks, with no harm or leakage, and shows consistent results across model cells.

media r/LocalLLaMA · 9d ago

Qwable-v1 Released as Distillation of Claude Fable-5

Qwable-v1, an open-weight model distilled from Anthropic's Fable-5, is now publicly available on Hugging Face. It captures 4,659 cleartext agentic-coding traces from Fable-5's public corpus and emits properly formatted <tool_use> XML calls to Claude-flavored tools, reflecting the original tool surface in its weights.

media r/LocalLLaMA · 9d ago

vLLM releases new streaming parser for Qwen3+ in nightly

vLLM has introduced a new streaming parser for Qwen3+ available in its nightly build, addressing issues like mid-turn stopping and failed streaming tool calls due to chunk boundaries. The update reportedly resolves these problems in limited testing, improving reliability for agentic workflows.

blog Simon Willison · 9d ago

datasette-agent 0.3a0 releases with user approval for write SQL operations

datasette-agent 0.3a0 introduces the execute_write_sql tool that prompts users before writing to databases, ensuring permission checks are respected. The update also enhances datasette agent chat with user approval support, new command options like --unsafe for auto-approval, and plain text tool outputs for CLI display.