Code generation — korshunov.ai

Code generation Page 1 / 14

LLaMA.cpp Release b9715 Adds CUDA Col2Im 1D and Multiple Platform Binaries

LLaMA.cpp version b9715 introduces CUDA support for GGML_OP_COL2IM_1D, building on a CPU implementation. The release includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and acceleration frameworks, including Vulkan, ROCm, OpenVINO, and SYCL.

arxiv arXiv cs.AI · 6d ago

Multi-LCB: Extending LiveCodeBench to 12 Programming Languages

Multi-LCB extends LiveCodeBench to twelve programming languages, preserving its contamination controls and evaluation protocol. It reveals Python overfitting, language-specific biases, and significant performance gaps among LLMs across languages, establishing a rigorous benchmark for cross-language code generation.

arxiv arXiv cs.AI · 6d ago

G2Rec: Unified Framework for Generative Recommendation

G2Rec introduces a scalable framework that combines holistic graph-based user co-engagement modeling with semantic tokenization. It enables generative recommendation models to capture comprehensive, semantically grounded user interest prototypes without ground-truth user interests, outperforming existing methods in industrial-scale sequential recommendation.

arxiv arXiv cs.LG · 6d ago

Probe-and-Refine Tuning Improves Coding Agent Performance

A new method called probe-and-refine tuning uses synthetic bug-fix probes to iteratively improve repository guidance files with single-shot LLM calls, without agent loops or tool use. On SWE-bench Verified, it achieves a 33.0% mean resolve rate—14.5 percentage points higher than the initial static knowledge base—showing improved coverage rather than patch precision. The method enables agents to use larger step budgets effectively, and performance remains stable across models when diagnostic output is sufficient.

arxiv arXiv cs.AI · 6d ago

IHUBERT: Persian Pretrained Model with Semantic Deduplication

IHUBERT is a monolingual Persian pretrained language model trained on a 45 GB curated subset of the Sepahr-Danesh collection. It uses vector-based semantic deduplication and a domain-balanced pretraining pipeline to improve corpus quality and reduce redundancy, achieving top performance in extractive question answering and strong results in NER and topic classification, though relation extraction remains a challenge.

arxiv arXiv cs.AI · 6d ago

Adaptive LLM Tutoring Improves Engagement and Efficiency

A new adaptive LLM tutoring system uses subject-aware prompting to enhance student engagement. It outperforms static models in simulation and shows real-world effectiveness, reducing interactions by 3 turns and increasing exercise conversion rates to 28.1% with a stochastic strategy.

arxiv arXiv cs.AI · 6d ago

SoftSkill: Behavioral Compression for Contextual Adaptation

SoftSkill proposes a method to compress natural-language skills into compact latent priors, improving task performance on SearchQA, LiveMath, and DocVQA. It outperforms SkillOpt by 5.2 to 12.5 points on key benchmarks while replacing hundreds to thousands of Markdown tokens with a few virtual tokens.

arxiv arXiv cs.AI · 6d ago

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

AutoPass uses runtime and compiler evidence to guide LLM-generated optimization decisions, outperforming expert heuristics and classical autotuning methods. It achieves geometric-mean speedups of 1.043x on x86-64 and 1.117x on ARM64 systems without prior training or fine-tuning.

arxiv arXiv cs.LG · 6d ago

LLM-Generated GPU Kernels Face Correctness Illusion

Benchmarks using fixed-shape checks miss real bugs in LLM-generated GPU kernels. A controlled corpus of 24 kernels, including 9 buggy variants with transcription errors, reveals that an op-schema-aware oracle detects all failures and passes all correct controls, with identical results across five GPU architectures.

arxiv arXiv cs.LG · 6d ago

Adaptive LLM Tutoring Improves Engagement and Efficiency

A new system uses subject-aware prompting to adapt tutoring strategies based on student performance and discipline. A/B testing with 656 student conversations shows the model reduces interactions by 3 turns and increases learning strategy conversion from 19.1% to 28.1% with a stochastic router.

lab Claude Code Releases · 6d ago

v2.1.183 Release Notes

v2.1.183 improves auto mode safety by blocking destructive git and destroy commands without explicit user consent. It adds deprecation warnings for models, introduces attribution.sessionUrl to hide session links, and fixes multiple issues including terminal behavior, subagent performance, and input handling in web and tmux environments.

arxiv arXiv cs.CL · 6d ago

AgentFinVQA: Auditable, On-Premise Financial Chart QA

AgentFinVQA introduces a multi-agent pipeline for financial chart question answering that ensures auditability and on-premise deployability without significant accuracy loss. It outperforms baseline models by +7.68 pp using a proprietary backbone and +4.84 pp with open-weights Qwen3.6-27B-FP8, while providing a confidence signal via verifier output that improves human review routing.

arxiv arXiv cs.CL · 6d ago

JAMER: Project-Level Code Framework Dataset and Benchmark

JAMER introduces JamSet and JamBench, the first project-level game code dataset and benchmark on a professional game engine. Built from 8,133 verified Game Jam projects, it enables deterministic evaluation and reveals a capability cliff in AI models as project scale increases, with runtime pass rates dropping from 80.4% to 5.7%.

arxiv arXiv cs.CL · 6d ago

Zero-Shot Agentic LLMs Extract Lung Pathology from Narratives

A zero-shot agentic workflow using open-source LLMs extracts 13 College of American Pathologists synoptic fields from lung resection pathology reports. The best model (GPT-OSS-20B) achieved a Micro-F1 of 0.893, outperforming baseline recall and accurately capturing complex pathologic relations without task-specific training.

arxiv arXiv cs.CL · 6d ago

STAGE: Source-Grounded Data Generation for Text-to-JSON

STAGE is a pipeline that generates text-to-JSON training data by using LLMs to synthesize reports and JSON schemas, validated against underlying spreadsheets. Evaluations on STAGE-Eval show it improves Qwen3-4B exact match from 31.37% to 74.27% and value accuracy from 45.46% to 90.69%.

arxiv arXiv cs.CL · 6d ago

IHUBERT: Persian Pretrained Model with Semantic Deduplication

arxiv arXiv cs.CL · 6d ago

Adaptive LLM Tutoring Improves Engagement and Efficiency

A new adaptive LLM tutoring system uses subject-aware prompting to enhance student engagement. It outperforms static models in simulation and real-world A/B testing, reducing interactions by 3 turns and increasing exercise conversion rates, especially with a stochastic router achieving 28.1%.

arxiv arXiv cs.CL · 6d ago

PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback

PsyScore integrates diagnostic scoring and instructional feedback using a shared latent ability model. It features a trait-adaptive neural IRT scorer based on GPCM, a ZPD-scaffolded feedback generator that tailors instruction by proficiency level, and a multi-perspective evaluation strategy. Experiments on ASAP++ show competitive scoring and more pedagogically aligned feedback compared to existing methods.

blog Simon Willison · 6d ago

Datasette Launches Apps Plugin for Custom HTML Applications

Datasette has released a new plugin, datasette-apps, enabling self-contained HTML+JavaScript applications to run in a secure iframe sandbox. These apps can execute read-only or write SQL queries against Datasette databases, with built-in security features like CSP headers and sandbox restrictions to prevent data exfiltration or unauthorized access.

media r/LocalLLaMA · 6d ago

GLM-5.2 (744B, 2-bit) achieves 7.3 tok/s on 4×3090 with 192GB RAM

GLM-5.2 UD-IQ2_M runs at ~7.3 tokens per second on 4×RTX 3090s with 192GB DDR5 RAM using llama.cpp expert offload. Reducing quantization from IQ2 to IQ1 provided no speed gain, while increasing CPU threads from 6 to 12 improved performance by 22%. Decode is limited by CPU compute, not memory bandwidth, and the offloaded experts must be explicitly distributed across GPUs to avoid out-of-memory errors.