Code generation — korshunov.ai

Code generation Page 10 / 14

GLM-5.2 Review and Censorship Response

GLM-5.2 demonstrates exceptional long-context coherence and conversational fluency, outperforming Gemini-3.1-Pro on text-only tasks and matching GPT-5.5 in reasoning quality. The model responds factually to sensitive topics like Taiwan and Tiananmen Square, providing detailed historical context without overt censorship, though it adheres to Chinese government content guidelines.

arxiv arXiv cs.AI · 8d ago

LLM-as-Interface, ML-as-Predictor for Pediatric Appendicitis

ClaMPAPP, a hybrid system, uses an LLM to extract structured clinical features from free-text notes and passes them to an XGBoost classifier for diagnosis. It outperformed end-to-end LLMs in both internal and external validation, with better diagnostic performance and fewer missed cases, demonstrating superior stability and safety in pediatric appendicitis triage.

arxiv arXiv cs.AI · 8d ago

Trade-offs in Medical LLM Adaptation: French QA Study

A study compares continual pretraining (CPT), supervised fine-tuning (SFT), and their combination for French medical QA. CPT+SFT performs best in multiple-choice QA, though gains over SFT are small and often insignificant, making SFT a cost-effective default. For open-ended QA, CPT improves metrics while SFT degrades quality, with instruction tuning and CPT+SFT favored by LLM-based evaluations. Cross-lingual results show effective transfer from French to English benchmarks.

arxiv arXiv cs.AI · 8d ago

Reverse-Engineering Transformer Attention with Executable Programs

A new method uses program synthesis to generate Python programs that reproduce attention patterns in transformer models. Fewer than 1,000 such programs achieve over 75% intersection-over-union similarity on TinyStories, and replacing 25% of attention heads with these programs increases perplexity by only 16% while preserving performance on question-answering tasks.

arxiv arXiv cs.AI · 8d ago

Data Intelligence Agents Enable Autonomous Data Querying

Data Intelligence Agents (DIA) deploy autonomous coding agents to streamline enterprise data workflows. The Query Generator matches or exceeds top published results on seven SQL benchmarks across four dialects, showing generalization through natural-language instructions and execution-based architecture.

media r/LocalLLaMA · 8d ago

Benchmarking small LLMs on hard HTML data extraction

A user tested models from 2B to 35B parameters on 29 difficult HTML data extraction pages, finding that smaller models like gemma4 e2b and e4b outperform larger ones. Qwen3.6 27B led in performance, while all MOE models scored poorly, highlighting the importance of task-specific benchmarking.

arxiv arXiv cs.CL · 8d ago

LLM-as-Interface, ML-as-Predictor for Pediatric Appendicitis

ClaMPAPP, a hybrid system, uses an LLM to extract structured clinical features from free-text notes and passes them to an XGBoost classifier for diagnosis. It outperformed end-to-end LLMs in both internal and external validation, with better stability and fewer missed appendicitis cases, demonstrating superior diagnostic performance and safety in pediatric triage.

arxiv arXiv cs.CL · 8d ago

Empirical Study of Medical LLM Adaptation in French QA

A study compares continual pretraining (CPT), supervised fine-tuning (SFT), and their combination for French medical QA. CPT+SFT performs best in multiple-choice QA, though gains over SFT are minimal and often insignificant, making SFT a cost-effective default. For open-ended QA, CPT improves metrics while SFT degrades generation quality, with instruction tuning and CPT+SFT favored by LLM-based evaluations. Cross-lingual results show effective transfer from French to English benchmarks.

arxiv arXiv cs.LG · 8d ago

REVES: Augmented Training for Test-Time Scaling

REVES introduces a two-stage iterative framework that enhances LLM reasoning through sequential revision and verification. It achieves +6.5 points over RL baselines and +4.0 points over standard multi-turn training on LiveCodeBench, using a 4B base model with fewer rollouts than large evolutionary systems. The method improves error correction and generalizes to out-of-distribution puzzles like n_queens and mini_sudoku.

arxiv arXiv cs.LG · 8d ago

Unsupervised Reward Optimization for Protein Language Models

A new framework enables protein language models to generate controllable protein sequences without labeled data or wet-lab validation. It uses task-agnostic rewards based on model uncertainty and semantic consistency to guide generation, with Soft and Binarized Reward Optimization outperforming baselines in coverage and controllability across diverse conditions.

arxiv arXiv cs.LG · 8d ago

Sumi: Open Uniform Diffusion Language Model from Scratch

Sumi is a 7B-parameter uniform diffusion language model pretrained from scratch on 1.5T tokens. It competes with autoregressive models on knowledge, reasoning, and coding tasks but underperforms on commonsense benchmarks, likely due to its education-heavy data mixture. The model weights, checkpoints, and full training recipe are publicly released.

arxiv arXiv cs.LG · 8d ago

JourneyFormer: Sequence Modeling for Airbnb Guest Journeys

JourneyFormer is a sequence modeling solution deployed at Airbnb to improve search ranking. It addresses production challenges like long, exploratory guest sequences and sparse booking labels through tailored design choices in data selection, embeddings, and label attribution. The model has shown improved offline metrics and significant business gains in online A/B tests across multiple production surfaces.

arxiv arXiv cs.LG · 8d ago

OpenAnt: LLM-Powered Vulnerability Discovery System

OpenAnt uses code decomposition, adversarial verification, and dynamic testing to identify vulnerabilities in large codebases. It reduces analysis surface by up to 97% and cuts false positives while validating findings through automated, sandboxed execution. Evaluated on OpenSSL, WordPress, and Flowise, it discovers previously unknown vulnerabilities with manageable cost and scalability.

arxiv arXiv cs.CL · 8d ago

HandwritingAgent: Language-Driven Handwriting Synthesis in SVG

HandwritingAgent synthesizes natural handwriting in SVG format without style-specific training. It uses a large reasoning model to generate stroke sequences in a grid canvas, conditioned on text input and a reference style image, enabling efficient, controllable, and generalizable handwriting generation.

arxiv arXiv cs.CL · 8d ago

Approximate Structured Diffusion for Sequence Labelling

A new method uses diffusion to train CRFs on entire label sequences, conditioning on noisy labels. When combined with approximate inference, it reduces POS-tagging error by 16.5%.

arxiv arXiv cs.CL · 8d ago

Distillation with Synthetic Data for Financial Sentiment Analysis

A framework transfers knowledge from large instruction-tuned models to compact ones using synthetic data generated via structured few-shot prompting. Clustering-based seed selection produces more representative synthetic examples than random sampling, enabling compact models to achieve strong performance with minimal human labeling. On complex, noisy financial text, the student model outperforms the teacher model, while remaining competitive on formal text.

arxiv arXiv cs.CL · 8d ago

REVES: Augmented Training for Test-Time Scaling

REVES introduces a two-stage iterative framework that enhances large language model reasoning through sequential revision and verification. It achieves +6.5 points over RL baselines and +4.0 points over standard multi-turn training on LiveCodeBench, using a 4B base model with fewer rollouts than larger systems. The method improves error correction and generalizes to out-of-distribution puzzles like n_queens and mini_sudoku.

arxiv arXiv cs.CL · 8d ago

Sumi: Open Uniform Diffusion Language Model from Scratch

arxiv arXiv cs.AI · 8d ago

ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch

ProfiLLM introduces an agentic LLM pipeline that extracts behavioral signals from ride-hailing logs to generate user profiles. It achieves up to +6.14% relative AUC improvement and up to +4.35% GMV gain in dispatching simulations, with consistent online A/B test results showing +0.47% GMV, +0.33% Completion Rate, and -0.82% Cancel-Before-Accept rate improvements.

arxiv arXiv cs.AI · 8d ago

SAERec: Fine-grained Intent Priors via Sparse Autoencoders

SAERec constructs fine-grained, interpretable intent priors from textual corpora using sparse autoencoders to disentangle intent-related semantics. It retrieves both personal and public intents for users, guiding recommendations with human-understandable explanations and outperforms state-of-the-art models on public datasets.