Code generation — korshunov.ai

Code generation Page 8 / 14

Has anyone used VibeThinker-3B outside benchmarks?

A Reddit user asks about real-world performance of VibeThinker-3B beyond benchmark scores, focusing on debugging, coding, reasoning, latency, and usability. The model is available on Hugging Face and described in a paper on arXiv.

github llama.cpp · 6d ago

llama.cpp release b9718: consolidated slot selection and new binary builds

llama.cpp version b9718 consolidates slot selection into a single function, get_available_slot, while maintaining LCP similarity checks for prompt cache updates. The release includes binary builds for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options.

media r/LocalLLaMA · 7d ago

Little late thank you to the DeepSeek team!

A user thanked the DeepSeek team for releasing DeepSeek V4 Pro and its Flash version, which fits on local hardware. The post was made seven months after an initial Reddit post.

media Latent Space · 7d ago

GLM-5.2 Passes Vibe Check, Outperforms GPT-5.5

GLM-5.2 has passed a 'vibe check' as a frontier open model, receiving praise from Jeremy Howard and outperforming GPT-5.5 in Artificial Analysis' new knowledge work benchmark. It also gained validation from the /r/LocalLlama community, indicating strong real-world utility and performance.

media r/LocalLLaMA · 7d ago

How can I self host code review?

A user asks about self-hosting code review tools due to Gemini Code Assist ending consumer support and moving to enterprise only. They are exploring GitHub apps or actions for local or cloud-based solutions.

github llama.cpp · 7d ago

LLaMA.cpp Release b9715 Adds CUDA Col2Im 1D and Multiple Platform Binaries

LLaMA.cpp version b9715 introduces CUDA support for GGML_OP_COL2IM_1D, building on a CPU implementation. The release includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and acceleration frameworks, including Vulkan, ROCm, OpenVINO, and SYCL.

arxiv arXiv cs.AI · 7d ago

Multi-LCB: Extending LiveCodeBench to 12 Programming Languages

Multi-LCB extends LiveCodeBench to twelve programming languages, preserving its contamination controls and evaluation protocol. It reveals Python overfitting, language-specific biases, and significant performance gaps among LLMs across languages, establishing a rigorous benchmark for cross-language code generation.

arxiv arXiv cs.AI · 7d ago

G2Rec: Unified Framework for Generative Recommendation

G2Rec introduces a scalable framework that combines holistic graph-based user co-engagement modeling with semantic tokenization. It enables generative recommendation models to capture comprehensive, semantically grounded user interest prototypes without ground-truth user interests, outperforming existing methods in industrial-scale sequential recommendation.

arxiv arXiv cs.LG · 7d ago

Probe-and-Refine Tuning Improves Coding Agent Performance

A new method called probe-and-refine tuning uses synthetic bug-fix probes to iteratively improve repository guidance files with single-shot LLM calls, without agent loops or tool use. On SWE-bench Verified, it achieves a 33.0% mean resolve rate—14.5 percentage points higher than the initial static knowledge base—showing improved coverage rather than patch precision. The method enables agents to use larger step budgets effectively, and performance remains stable across models when diagnostic output is sufficient.

arxiv arXiv cs.AI · 7d ago

IHUBERT: Persian Pretrained Model with Semantic Deduplication

IHUBERT is a monolingual Persian pretrained language model trained on a 45 GB curated subset of the Sepahr-Danesh collection. It uses vector-based semantic deduplication and a domain-balanced pretraining pipeline to improve corpus quality and reduce redundancy, achieving top performance in extractive question answering and strong results in NER and topic classification, though relation extraction remains a challenge.

arxiv arXiv cs.AI · 7d ago

Adaptive LLM Tutoring Improves Engagement and Efficiency

A new adaptive LLM tutoring system uses subject-aware prompting to enhance student engagement. It outperforms static models in simulation and shows real-world effectiveness, reducing interactions by 3 turns and increasing exercise conversion rates to 28.1% with a stochastic strategy.

arxiv arXiv cs.AI · 7d ago

SoftSkill: Behavioral Compression for Contextual Adaptation

SoftSkill proposes a method to compress natural-language skills into compact latent priors, improving task performance on SearchQA, LiveMath, and DocVQA. It outperforms SkillOpt by 5.2 to 12.5 points on key benchmarks while replacing hundreds to thousands of Markdown tokens with a few virtual tokens.

arxiv arXiv cs.AI · 7d ago

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

AutoPass uses runtime and compiler evidence to guide LLM-generated optimization decisions, outperforming expert heuristics and classical autotuning methods. It achieves geometric-mean speedups of 1.043x on x86-64 and 1.117x on ARM64 systems without prior training or fine-tuning.

arxiv arXiv cs.LG · 7d ago

LLM-Generated GPU Kernels Face Correctness Illusion

Benchmarks using fixed-shape checks miss real bugs in LLM-generated GPU kernels. A controlled corpus of 24 kernels, including 9 buggy variants with transcription errors, reveals that an op-schema-aware oracle detects all failures and passes all correct controls, with identical results across five GPU architectures.

arxiv arXiv cs.LG · 7d ago

Adaptive LLM Tutoring Improves Engagement and Efficiency

A new system uses subject-aware prompting to adapt tutoring strategies based on student performance and discipline. A/B testing with 656 student conversations shows the model reduces interactions by 3 turns and increases learning strategy conversion from 19.1% to 28.1% with a stochastic router.

lab Claude Code Releases · 7d ago

v2.1.183 Release Notes

v2.1.183 improves auto mode safety by blocking destructive git and destroy commands without explicit user consent. It adds deprecation warnings for models, introduces attribution.sessionUrl to hide session links, and fixes multiple issues including terminal behavior, subagent performance, and input handling in web and tmux environments.

arxiv arXiv cs.CL · 7d ago

AgentFinVQA: Auditable, On-Premise Financial Chart QA

AgentFinVQA introduces a multi-agent pipeline for financial chart question answering that ensures auditability and on-premise deployability without significant accuracy loss. It outperforms baseline models by +7.68 pp using a proprietary backbone and +4.84 pp with open-weights Qwen3.6-27B-FP8, while providing a confidence signal via verifier output that improves human review routing.

arxiv arXiv cs.CL · 7d ago

JAMER: Project-Level Code Framework Dataset and Benchmark

JAMER introduces JamSet and JamBench, the first project-level game code dataset and benchmark on a professional game engine. Built from 8,133 verified Game Jam projects, it enables deterministic evaluation and reveals a capability cliff in AI models as project scale increases, with runtime pass rates dropping from 80.4% to 5.7%.

arxiv arXiv cs.CL · 7d ago

Zero-Shot Agentic LLMs Extract Lung Pathology from Narratives

A zero-shot agentic workflow using open-source LLMs extracts 13 College of American Pathologists synoptic fields from lung resection pathology reports. The best model (GPT-OSS-20B) achieved a Micro-F1 of 0.893, outperforming baseline recall and accurately capturing complex pathologic relations without task-specific training.

arxiv arXiv cs.CL · 7d ago

STAGE: Source-Grounded Data Generation for Text-to-JSON

STAGE is a pipeline that generates text-to-JSON training data by using LLMs to synthesize reports and JSON schemas, validated against underlying spreadsheets. Evaluations on STAGE-Eval show it improves Qwen3-4B exact match from 31.37% to 74.27% and value accuracy from 45.46% to 90.69%.