Code generation — korshunov.ai

Code generation Page 1 / 14

Reverse-Engineering Transformer Attention with Executable Programs

A new method uses program synthesis to generate Python programs that reproduce attention patterns in transformer models. Fewer than 1,000 such programs achieve over 75% intersection-over-union similarity on TinyStories, and replacing 25% of attention heads with these programs increases perplexity by only 16% while preserving performance on question-answering tasks.

arxiv arXiv cs.AI · 7d ago

Data Intelligence Agents Enable Autonomous Data Querying

Data Intelligence Agents (DIA) deploy autonomous coding agents to streamline enterprise data workflows. The Query Generator matches or exceeds top published results on seven SQL benchmarks across four dialects, showing generalization through natural-language instructions and execution-based architecture.

media r/LocalLLaMA · 7d ago

Benchmarking small LLMs on hard HTML data extraction

A user tested models from 2B to 35B parameters on 29 difficult HTML data extraction pages, finding that smaller models like gemma4 e2b and e4b outperform larger ones. Qwen3.6 27B led in performance, while all MOE models scored poorly, highlighting the importance of task-specific benchmarking.

arxiv arXiv cs.CL · 7d ago

LLM-as-Interface, ML-as-Predictor for Pediatric Appendicitis

ClaMPAPP, a hybrid system, uses an LLM to extract structured clinical features from free-text notes and passes them to an XGBoost classifier for diagnosis. It outperformed end-to-end LLMs in both internal and external validation, with better stability and fewer missed appendicitis cases, demonstrating superior diagnostic performance and safety in pediatric triage.

arxiv arXiv cs.CL · 7d ago

Empirical Study of Medical LLM Adaptation in French QA

A study compares continual pretraining (CPT), supervised fine-tuning (SFT), and their combination for French medical QA. CPT+SFT performs best in multiple-choice QA, though gains over SFT are minimal and often insignificant, making SFT a cost-effective default. For open-ended QA, CPT improves metrics while SFT degrades generation quality, with instruction tuning and CPT+SFT favored by LLM-based evaluations. Cross-lingual results show effective transfer from French to English benchmarks.

arxiv arXiv cs.LG · 7d ago

REVES: Augmented Training for Test-Time Scaling

REVES introduces a two-stage iterative framework that enhances LLM reasoning through sequential revision and verification. It achieves +6.5 points over RL baselines and +4.0 points over standard multi-turn training on LiveCodeBench, using a 4B base model with fewer rollouts than large evolutionary systems. The method improves error correction and generalizes to out-of-distribution puzzles like n_queens and mini_sudoku.

arxiv arXiv cs.LG · 7d ago

Unsupervised Reward Optimization for Protein Language Models

A new framework enables protein language models to generate controllable protein sequences without labeled data or wet-lab validation. It uses task-agnostic rewards based on model uncertainty and semantic consistency to guide generation, with Soft and Binarized Reward Optimization outperforming baselines in coverage and controllability across diverse conditions.

arxiv arXiv cs.LG · 7d ago

Sumi: Open Uniform Diffusion Language Model from Scratch

Sumi is a 7B-parameter uniform diffusion language model pretrained from scratch on 1.5T tokens. It competes with autoregressive models on knowledge, reasoning, and coding tasks but underperforms on commonsense benchmarks, likely due to its education-heavy data mixture. The model weights, checkpoints, and full training recipe are publicly released.

arxiv arXiv cs.LG · 7d ago

JourneyFormer: Sequence Modeling for Airbnb Guest Journeys

JourneyFormer is a sequence modeling solution deployed at Airbnb to improve search ranking. It addresses production challenges like long, exploratory guest sequences and sparse booking labels through tailored design choices in data selection, embeddings, and label attribution. The model has shown improved offline metrics and significant business gains in online A/B tests across multiple production surfaces.

arxiv arXiv cs.LG · 7d ago

OpenAnt: LLM-Powered Vulnerability Discovery System

OpenAnt uses code decomposition, adversarial verification, and dynamic testing to identify vulnerabilities in large codebases. It reduces analysis surface by up to 97% and cuts false positives while validating findings through automated, sandboxed execution. Evaluated on OpenSSL, WordPress, and Flowise, it discovers previously unknown vulnerabilities with manageable cost and scalability.

arxiv arXiv cs.CL · 7d ago

HandwritingAgent: Language-Driven Handwriting Synthesis in SVG

HandwritingAgent synthesizes natural handwriting in SVG format without style-specific training. It uses a large reasoning model to generate stroke sequences in a grid canvas, conditioned on text input and a reference style image, enabling efficient, controllable, and generalizable handwriting generation.

arxiv arXiv cs.CL · 7d ago

Approximate Structured Diffusion for Sequence Labelling

A new method uses diffusion to train CRFs on entire label sequences, conditioning on noisy labels. When combined with approximate inference, it reduces POS-tagging error by 16.5%.

arxiv arXiv cs.CL · 7d ago

Distillation with Synthetic Data for Financial Sentiment Analysis

A framework transfers knowledge from large instruction-tuned models to compact ones using synthetic data generated via structured few-shot prompting. Clustering-based seed selection produces more representative synthetic examples than random sampling, enabling compact models to achieve strong performance with minimal human labeling. On complex, noisy financial text, the student model outperforms the teacher model, while remaining competitive on formal text.

arxiv arXiv cs.CL · 7d ago

REVES: Augmented Training for Test-Time Scaling

REVES introduces a two-stage iterative framework that enhances large language model reasoning through sequential revision and verification. It achieves +6.5 points over RL baselines and +4.0 points over standard multi-turn training on LiveCodeBench, using a 4B base model with fewer rollouts than larger systems. The method improves error correction and generalizes to out-of-distribution puzzles like n_queens and mini_sudoku.

arxiv arXiv cs.CL · 7d ago

Sumi: Open Uniform Diffusion Language Model from Scratch

arxiv arXiv cs.AI · 7d ago

ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch

ProfiLLM introduces an agentic LLM pipeline that extracts behavioral signals from ride-hailing logs to generate user profiles. It achieves up to +6.14% relative AUC improvement and up to +4.35% GMV gain in dispatching simulations, with consistent online A/B test results showing +0.47% GMV, +0.33% Completion Rate, and -0.82% Cancel-Before-Accept rate improvements.

arxiv arXiv cs.AI · 7d ago

SAERec: Fine-grained Intent Priors via Sparse Autoencoders

SAERec constructs fine-grained, interpretable intent priors from textual corpora using sparse autoencoders to disentangle intent-related semantics. It retrieves both personal and public intents for users, guiding recommendations with human-understandable explanations and outperforms state-of-the-art models on public datasets.

arxiv arXiv cs.AI · 7d ago

CAPRA: Multi-Agent LLM System for Software Architecture Feedback

CAPRA is a multi-agent LLM system that generates personalized, template-compliant LaTeX feedback on software architecture deliverables. It uses specialized agents, PyMuPDF, and gpt-4o to extract and analyze text and UML diagrams, with evidence anchoring and consistency management to ensure reliability. A preliminary evaluation of 10 student reports shows CAPRA met 88.8% of eight criteria and achieved moderate inter-rater agreement (kappa = 0.582), with each report processed in under 4 minutes.

arxiv arXiv cs.AI · 7d ago

Variability in AI-Generated Software: A New Product-Line Approach

An exploratory analysis of 10 vibe-coded C/C++ projects reveals near-zero in-artifact variability, with all decisions resolved at generation time. The paper proposes Variability by Regeneration (VbR), a product-line approach where an LLM acts as a derivation engine, generating tailored binaries from declarative specifications, with a variant dispatcher routing user requests to the correct binary. VbR shifts variability into specifications, not code, offering a new paradigm for SPL engineering.

blog Simon Willison · 8d ago

GLM-5.2 is the leading open weights model on the Artificial Analysis Intelligence Index

GLM-5.2, a 753B-parameter text-only model from Z.ai, is now the top open weights model on the Artificial Analysis Intelligence Index, outperforming MiniMax-M3, DeepSeek V4 Pro, and Kimi K2.6. It features a 1 million token context window and ranks second on the Code Arena WebDev leaderboard, despite lacking image input capabilities.