Code generation — korshunov.ai

Code generation Page 1 / 14

P4IR Framework Improves LLM-Based Code Compliance Accuracy

P4IR, a two-stage framework, uses supervised fine-tuning and Group Relative Policy Optimization to enhance large language model-based automated code compliance systems. It reduces tree edit and token-level Levenshtein distances by up to 23.8% and 38.6% respectively, outperforming leading LLMs like Claude Opus, GPT-5.2, and GLM-4.7 in zero-shot settings with few-shot prompting, and reduces false positives by a statistically significant margin.

media Hugging Face Forums · 3d ago

Buddy System: Rust entropy monitor with NER-gated uncertainty for tiered LLM inference

The Buddy System uses a Rust entropy monitor to detect per-token uncertainty in local Gemma 3 4B inference, routing only uncertain tokens to Sonnet via NER-gated span extraction and semantic retrieval. Benchmarks show it achieves 71.4% accuracy at $0.21, outperforming the Anthropic Advisor pattern (62.9% at $0.44) across seven Hugging Face datasets, with a key improvement on SQuAD v2 by routing source passage chunks to the cloud model.

arxiv arXiv cs.CL · 3d ago

Latent Personal Memory: Dynamic Soft Prompts for LLM Personalization

Latent Personal Memory (LPM) represents user-specific memories as a compact, persistent matrix of N latent slots. These slots are mapped via a shared cross-attention network into dynamic, input-conditioned soft prompts that are prepended to a frozen LLM. LPM outperforms LoRA and Prompt Tuning by up to 8.8% and 54.4% on PersonaMem v1, reduces KV-cache usage by over 64x, matches LoRA accuracy on LoCoMo with 120x fewer parameters, and scales efficiently with context length, outperforming full-context at 128K tokens.

arxiv arXiv cs.CL · 3d ago

GRAG Framework Decouples Grounding and Personalization in Conversational AI

GRAG decouples content grounding and personalization in conversational models by using generic responses from large language models as a structural scaffold. This approach enables smaller, resource-limited models to achieve up to 47% improvement in ROUGE-2 and 36% in BLEU scores over state-of-the-art methods on diverse benchmarks.

arxiv arXiv cs.CL · 3d ago

CAT-Translate: Compact Japanese-English Models Outperform Multilingual Ones in Real-World Tasks

CAT-Translate introduces a family of small, open-source models specialized for Japanese-English translation. Using synthetic parallel corpora and a two-stage fine-tuning approach, the models achieve superior performance on real-world benchmarks across business, legal, medical, financial, and patent domains, outperforming large multilingual models in practical applications.

arxiv arXiv cs.CL · 3d ago

Precision-Recall Controllable Radiology Report Generation

A reinforcement learning framework enables precise control over clinical precision and recall in radiology report generation. By integrating a clinical reward and group-relative training, the model improves clinical efficacy beyond language fluency metrics, outperforming state-of-the-art methods on the MIMIC-CXR dataset.

arxiv arXiv cs.CL · 3d ago

Benchmark Evaluation of Small Language Models for Arabic NLP

A benchmark of 240 Arabic test items across eight domains and ten skills assesses twelve small language models in zero-shot settings. Gemma 3 (12B) achieved the highest overall score (4.548/5), followed by Aya and C4AI Command Arabic, with performance linked more to Arabic alignment and instruction-following than model size. Common failure modes include prompt leakage, hallucination, and weak task adherence.

media MarkTechPost · 3d ago

xAI Launches /goal in Grok Build for Autonomous Coding

xAI has introduced /goal, a mode in Grok Build that enables long-running, autonomous execution of multi-step coding tasks. The feature plans work, executes a progress checklist, and verifies results by reviewing code, inspecting webpages, or running scripts, ensuring completion before declaring success. Access requires a SuperGrok or X Premium Plus subscription.

media Hugging Face Forums · 3d ago

AI Music Model Runs in Real Time on Most CPUs in Browser

NanoMaestro Realtime is a 50MB AI music model with 13M parameters that generates piano music in real time using a 2-layer LSTM. It runs locally in the browser via ONNX and Transformers.js with WASM, requiring no GPU or server backend, and works on older Raspberry Pi models.

media r/LocalLLaMA · 3d ago

Microsoft Releases Open Source FastContext for LLM Coding Agents

Microsoft has open-sourced FastContext-1.0, a lightweight repository-exploration subagent that separates code repository exploration from task solving in LLM coding agents. It uses parallel read-only tool calls to return compact file paths and line ranges, improving end-to-end accuracy and reducing token usage by up to 60.3%, with the 4B-RL model outperforming a 30B-SFT model on SWE-bench Pro.

blog Simon Willison · 3d ago

Porting Moebius 0.2B Image Inpainting to Browser with Claude Code

The Moebius 0.2B image inpainting model has been successfully ported to run in the browser using WebGPU and ONNX Runtime. The project, initiated with Claude Code, converts the model's weights to ONNX and deploys them via Hugging Face, with a simple web interface available at simonw.github.io/moebius-web/.

media r/LocalLLaMA · 3d ago

Gemma 4's Potential to Outperform Mistral and Qwen3.6 Through Finetuning

Gemma 4 shows strong base performance and unique features like global MTP support, QAT, and out-of-the-box vision capabilities. While it currently lacks widespread finetunes, models like MeroMero, Equinox, and Gembrain have already demonstrated high quality, suggesting that with community effort, Gemma 4 could surpass Mistral or Qwen3.6 in specific tasks like coding and creative writing.

lab Claude Code Releases · 3d ago

Claude v2.1.186 Release Notes

Claude v2.1.186 adds CLI authentication commands for MCP servers, status filtering in workflows, and a "Skills" section in plugin settings. It includes numerous bug fixes for UI, session management, and agent behavior, along with improvements to YAML parsing, memory handling, and tool validation.

media MarkTechPost · 4d ago

Sakana AI Launches Sakana Fugu: Multi-Agent Orchestration Model

Sakana AI has launched Sakana Fugu, an orchestration model that routes tasks across a swappable pool of frontier LLMs via a single OpenAI-compatible API. Fugu Ultra outperforms individual models on key benchmarks like SWE Bench Pro and GPQA-D, and the system demonstrates superior performance on complex, multi-step tasks such as auto-research, Rubik's Cube solving, and blindfold chess.

lab OpenAI News · 4d ago

Jason Liu Uses Codex for Long-Running Project Management

Jason Liu demonstrates how Codex helps preserve context and manage complex projects, enabling work to continue seamlessly beyond a single prompt.

lab OpenAI News · 4d ago

OpenAI Launches Daybreak Security Tools

OpenAI has introduced Codex Security and GPT-5.5-Cyber as part of its Daybreak suite. These tools aim to help organizations identify, validate, and patch vulnerabilities at scale.

media r/LocalLLaMA · 4d ago

Best local model for converting text to structured JSON output

Users are seeking a local model that efficiently converts unstructured text into valid JSON based on a defined schema. Among tested models, Qwen 3.6 35B a3b shows strong performance, matching the quality of larger models like GPT-120B while being more stable on local machines than GPT-20B.

media r/LocalLLaMA · 4d ago

NEX-N2-mini claims Pareto optimality in reasoning efficiency

The NEX-N2-mini model asserts it achieves 3.5 and 3.6 level reasoning performance with significantly fewer reasoning tokens. Testing shows it outperforms other MoE models in efficiency, reducing wasted tokens while maintaining high reasoning quality.

media r/LocalLLaMA · 4d ago

Gemma4-12B-QAT Uncensored Balanced Released with 60% Speed Boost via MTP

The Gemma4-12B-QAT Uncensored Balanced model is now available, featuring a 60% speed improvement through multi-token-prediction (MTP) speculative decoding. It includes Q4_K_M quantization, vision support via mmproj, and stable generation with no looping or context drift, making it ideal for creative writing and emotional intelligence tasks.

media r/LocalLLaMA · 4d ago

Same model, same prompt, 4 different agents produce varied code quality

A self-hosted Qwen3.6-27B model with identical prompt and hardware generated four different HTML/JavaScript solar system simulations. The agent scaffolding significantly influenced output: opencode produced clean, stable code with accurate physics; pi showed robustness and coordinate consistency; hermes offered visually appealing but physically flawed results; qwen code generated minimal, crude code. The results highlight how agent design shapes code quality, correctness, and stability despite shared model and prompt.