How can I self host code review?
A user asks about self-hosting code review tools due to Gemini Code Assist ending consumer support and moving to enterprise only. They are exploring GitHub apps or actions for local or cloud-based solutions.
A user asks about self-hosting code review tools due to Gemini Code Assist ending consumer support and moving to enterprise only. They are exploring GitHub apps or actions for local or cloud-based solutions.
LLaMA.cpp version b9715 introduces CUDA support for GGML_OP_COL2IM_1D, building on a CPU implementation. The release includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and acceleration frameworks, including Vulkan, ROCm, OpenVINO, and SYCL.
Multi-LCB extends LiveCodeBench to twelve programming languages, preserving its contamination controls and evaluation protocol. It reveals Python overfitting, language-specific biases, and significant performance gaps among LLMs across languages, establishing a rigorous benchmark for cross-language code generation.
G2Rec introduces a scalable framework that combines holistic graph-based user co-engagement modeling with semantic tokenization. It enables generative recommendation models to capture comprehensive, semantically grounded user interest prototypes without ground-truth user interests, outperforming existing methods in industrial-scale sequential recommendation.
A new method called probe-and-refine tuning uses synthetic bug-fix probes to iteratively improve repository guidance files with single-shot LLM calls, without agent loops or tool use. On SWE-bench Verified, it achieves a 33.0% mean resolve rate—14.5 percentage points higher than the initial static knowledge base—showing improved coverage rather than patch precision. The method enables agents to use larger step budgets effectively, and performance remains stable across models when diagnostic output is sufficient.
IHUBERT is a monolingual Persian pretrained language model trained on a 45 GB curated subset of the Sepahr-Danesh collection. It uses vector-based semantic deduplication and a domain-balanced pretraining pipeline to improve corpus quality and reduce redundancy, achieving top performance in extractive question answering and strong results in NER and topic classification, though relation extraction remains a challenge.
A new adaptive LLM tutoring system uses subject-aware prompting to enhance student engagement. It outperforms static models in simulation and shows real-world effectiveness, reducing interactions by 3 turns and increasing exercise conversion rates to 28.1% with a stochastic strategy.
SoftSkill proposes a method to compress natural-language skills into compact latent priors, improving task performance on SearchQA, LiveMath, and DocVQA. It outperforms SkillOpt by 5.2 to 12.5 points on key benchmarks while replacing hundreds to thousands of Markdown tokens with a few virtual tokens.
AutoPass uses runtime and compiler evidence to guide LLM-generated optimization decisions, outperforming expert heuristics and classical autotuning methods. It achieves geometric-mean speedups of 1.043x on x86-64 and 1.117x on ARM64 systems without prior training or fine-tuning.
Benchmarks using fixed-shape checks miss real bugs in LLM-generated GPU kernels. A controlled corpus of 24 kernels, including 9 buggy variants with transcription errors, reveals that an op-schema-aware oracle detects all failures and passes all correct controls, with identical results across five GPU architectures.
A new system uses subject-aware prompting to adapt tutoring strategies based on student performance and discipline. A/B testing with 656 student conversations shows the model reduces interactions by 3 turns and increases learning strategy conversion from 19.1% to 28.1% with a stochastic router.
v2.1.183 improves auto mode safety by blocking destructive git and destroy commands without explicit user consent. It adds deprecation warnings for models, introduces attribution.sessionUrl to hide session links, and fixes multiple issues including terminal behavior, subagent performance, and input handling in web and tmux environments.
AgentFinVQA introduces a multi-agent pipeline for financial chart question answering that ensures auditability and on-premise deployability without significant accuracy loss. It outperforms baseline models by +7.68 pp using a proprietary backbone and +4.84 pp with open-weights Qwen3.6-27B-FP8, while providing a confidence signal via verifier output that improves human review routing.
JAMER introduces JamSet and JamBench, the first project-level game code dataset and benchmark on a professional game engine. Built from 8,133 verified Game Jam projects, it enables deterministic evaluation and reveals a capability cliff in AI models as project scale increases, with runtime pass rates dropping from 80.4% to 5.7%.
A zero-shot agentic workflow using open-source LLMs extracts 13 College of American Pathologists synoptic fields from lung resection pathology reports. The best model (GPT-OSS-20B) achieved a Micro-F1 of 0.893, outperforming baseline recall and accurately capturing complex pathologic relations without task-specific training.
STAGE is a pipeline that generates text-to-JSON training data by using LLMs to synthesize reports and JSON schemas, validated against underlying spreadsheets. Evaluations on STAGE-Eval show it improves Qwen3-4B exact match from 31.37% to 74.27% and value accuracy from 45.46% to 90.69%.
IHUBERT is a monolingual Persian pretrained language model trained on a 45 GB curated subset of the Sepahr-Danesh collection. It uses vector-based semantic deduplication and a domain-balanced pretraining pipeline to improve corpus quality and reduce redundancy, achieving top performance in extractive question answering and strong results in NER and topic classification, though relation extraction remains a challenge.
A new adaptive LLM tutoring system uses subject-aware prompting to enhance student engagement. It outperforms static models in simulation and real-world A/B testing, reducing interactions by 3 turns and increasing exercise conversion rates, especially with a stochastic router achieving 28.1%.
PsyScore integrates diagnostic scoring and instructional feedback using a shared latent ability model. It features a trait-adaptive neural IRT scorer based on GPCM, a ZPD-scaffolded feedback generator that tailors instruction by proficiency level, and a multi-perspective evaluation strategy. Experiments on ASAP++ show competitive scoring and more pedagogically aligned feedback compared to existing methods.
Datasette has released a new plugin, datasette-apps, enabling self-contained HTML+JavaScript applications to run in a secure iframe sandbox. These apps can execute read-only or write SQL queries against Datasette databases, with built-in security features like CSP headers and sandbox restrictions to prevent data exfiltration or unauthorized access.