Code generation — korshunov.ai

Code generation Page 1 / 14

Precision-Recall Controllable Radiology Report Generation

A reinforcement learning framework enables precise control over clinical precision and recall in radiology report generation. By integrating a clinical reward and group-relative training, the model improves clinical efficacy beyond language fluency metrics, outperforming state-of-the-art methods on the MIMIC-CXR dataset.

arxiv arXiv cs.CL · 2d ago

Benchmark Evaluation of Small Language Models for Arabic NLP

A benchmark of 240 Arabic test items across eight domains and ten skills assesses twelve small language models in zero-shot settings. Gemma 3 (12B) achieved the highest overall score (4.548/5), followed by Aya and C4AI Command Arabic, with performance linked more to Arabic alignment and instruction-following than model size. Common failure modes include prompt leakage, hallucination, and weak task adherence.

media Hugging Face Forums · 2d ago

AI Music Model Runs in Real Time on Most CPUs in Browser

NanoMaestro Realtime is a 50MB AI music model with 13M parameters that generates piano music in real time using a 2-layer LSTM. It runs locally in the browser via ONNX and Transformers.js with WASM, requiring no GPU or server backend, and works on older Raspberry Pi models.

media r/LocalLLaMA · 2d ago

Microsoft Releases Open Source FastContext for LLM Coding Agents

Microsoft has open-sourced FastContext-1.0, a lightweight repository-exploration subagent that separates code repository exploration from task solving in LLM coding agents. It uses parallel read-only tool calls to return compact file paths and line ranges, improving end-to-end accuracy and reducing token usage by up to 60.3%, with the 4B-RL model outperforming a 30B-SFT model on SWE-bench Pro.

blog Simon Willison · 2d ago

Porting Moebius 0.2B Image Inpainting to Browser with Claude Code

The Moebius 0.2B image inpainting model has been successfully ported to run in the browser using WebGPU and ONNX Runtime. The project, initiated with Claude Code, converts the model's weights to ONNX and deploys them via Hugging Face, with a simple web interface available at simonw.github.io/moebius-web/.

media r/LocalLLaMA · 2d ago

Gemma 4's Potential to Outperform Mistral and Qwen3.6 Through Finetuning

Gemma 4 shows strong base performance and unique features like global MTP support, QAT, and out-of-the-box vision capabilities. While it currently lacks widespread finetunes, models like MeroMero, Equinox, and Gembrain have already demonstrated high quality, suggesting that with community effort, Gemma 4 could surpass Mistral or Qwen3.6 in specific tasks like coding and creative writing.

lab Claude Code Releases · 2d ago

Claude v2.1.186 Release Notes

Claude v2.1.186 adds CLI authentication commands for MCP servers, status filtering in workflows, and a "Skills" section in plugin settings. It includes numerous bug fixes for UI, session management, and agent behavior, along with improvements to YAML parsing, memory handling, and tool validation.

media MarkTechPost · 3d ago

Sakana AI Launches Sakana Fugu: Multi-Agent Orchestration Model

Sakana AI has launched Sakana Fugu, an orchestration model that routes tasks across a swappable pool of frontier LLMs via a single OpenAI-compatible API. Fugu Ultra outperforms individual models on key benchmarks like SWE Bench Pro and GPQA-D, and the system demonstrates superior performance on complex, multi-step tasks such as auto-research, Rubik's Cube solving, and blindfold chess.

lab OpenAI News · 3d ago

Jason Liu Uses Codex for Long-Running Project Management

Jason Liu demonstrates how Codex helps preserve context and manage complex projects, enabling work to continue seamlessly beyond a single prompt.

lab OpenAI News · 3d ago

OpenAI Launches Daybreak Security Tools

OpenAI has introduced Codex Security and GPT-5.5-Cyber as part of its Daybreak suite. These tools aim to help organizations identify, validate, and patch vulnerabilities at scale.

media r/LocalLLaMA · 3d ago

Best local model for converting text to structured JSON output

Users are seeking a local model that efficiently converts unstructured text into valid JSON based on a defined schema. Among tested models, Qwen 3.6 35B a3b shows strong performance, matching the quality of larger models like GPT-120B while being more stable on local machines than GPT-20B.

media r/LocalLLaMA · 3d ago

NEX-N2-mini claims Pareto optimality in reasoning efficiency

The NEX-N2-mini model asserts it achieves 3.5 and 3.6 level reasoning performance with significantly fewer reasoning tokens. Testing shows it outperforms other MoE models in efficiency, reducing wasted tokens while maintaining high reasoning quality.

media r/LocalLLaMA · 3d ago

Gemma4-12B-QAT Uncensored Balanced Released with 60% Speed Boost via MTP

The Gemma4-12B-QAT Uncensored Balanced model is now available, featuring a 60% speed improvement through multi-token-prediction (MTP) speculative decoding. It includes Q4_K_M quantization, vision support via mmproj, and stable generation with no looping or context drift, making it ideal for creative writing and emotional intelligence tasks.

media r/LocalLLaMA · 3d ago

Same model, same prompt, 4 different agents produce varied code quality

A self-hosted Qwen3.6-27B model with identical prompt and hardware generated four different HTML/JavaScript solar system simulations. The agent scaffolding significantly influenced output: opencode produced clean, stable code with accurate physics; pi showed robustness and coordinate consistency; hermes offered visually appealing but physically flawed results; qwen code generated minimal, crude code. The results highlight how agent design shapes code quality, correctness, and stability despite shared model and prompt.

media Interconnects · 3d ago

GLM-5.2 is the step change for open agents

GLM-5.2, an open-weight AI model released by Z.ai, has set a new benchmark in coding and general agent performance. It outperforms models like Claude Fable 5 and Gemini, and matches or exceeds OpenAI's Opus 4.8 in max thinking mode, establishing itself as the first open model that feels right in coding harnesses as a general agent.

media r/LocalLLaMA · 3d ago

GLM-5.2 UD-IQ1_M Speed Test on llama.cpp with 5090 and 3090 Ti

A speed test of GLM-5.2 quantized to UD-IQ1_M using llama.cpp shows 579 t/s prefill at 8k context and 324 t/s at 57k context. Decode speed remains steady at 10.6 t/s for over 580 tokens, dropping to 9.37 t/s at 60k context.

media r/LocalLLaMA · 3d ago

I Built a Tool to Stop Manually Swapping Models on My 8GB GPU

I developed Prompt-Chain, a Streamlit app that chains a small Prompter model with a large Coder model into a single pipeline. It automatically swaps VRAM when transitioning from prompt refinement to code generation, eliminating manual model switching and reducing wasted tokens from poorly worded prompts.

media r/LocalLLaMA · 3d ago

GLM5.2 runs at 7tg on 4x 3090s with 192GB DDR5 on budget build

A user shares their home lab setup with four GeForce 3090 GPUs and 192GB of DDR5 RAM overclocked to 5600 MHz. They run GLM5.2 at 7 tera-giga (tg) as a planner, MiniMax 2.7 at 45tg in VRAM for coding, and Qwen3.6 27B at q8 for testing, all on consumer-grade hardware due to cost considerations.

media r/LocalLLaMA · 3d ago

Qwen3.6-35B-A3B APEX on RTX 3090: Speed and Quality Benchmarks

A benchmark compares llama.cpp forks (ik_llama and spiritbuun) running Qwen3.6-35B-A3B APEX with I-Compact and I-Quality models. ik_llama with I-Compact achieves highest speed (~146 TPS), while spiritbuun with I-Quality and turbo8/turbo4 cache matches this speed and offers slightly better HellaSwag performance. turbo8/turbo4 KV caches outperform q8_0/q5_0, especially at longer contexts, with up to 15% speed gain and lower KLD, making them superior for quality and context length.

media Hugging Face Forums · 3d ago

I built a novel triple-hybrid LLM under 1B parameters for ~$50

Mateusz has developed a full pre-trained language model, Project Inkblot's Titan v1, combining Mamba SSM, Multi-Head Attention, and 32-expert MoE in a single decoder-only architecture under 1B parameters. The model, trained on a single NVIDIA L4 GPU for ~$50, achieves 27.5 validation perplexity and demonstrates efficient scaling via a single-line config update, with all components implemented from scratch in PyTorch. Titan v2's first training cycle is now complete, and dataset expansion is underway.