Hugging Face — korshunov.ai

Lab · Hugging Face

The Buddy System uses a Rust entropy monitor to detect per-token uncertainty in local Gemma 3 4B inference, routing only uncertain tokens to Sonnet via NER-gated span extraction and semantic retrieval. Benchmarks show it achieves 71.4% accuracy at $0.21, outperforming the Anthropic Advisor pattern (62.9% at $0.44) across seven Hugging Face datasets, with a key improvement on SQuAD v2 by routing source passage chunks to the cloud model.

arxiv arXiv cs.CL · 2d ago

LRE: Few-Kilobytes Agent Memory with Zero Neural Cost

LRE is a CPU-only, language-model-free system that learns which interaction history units are load-bearing. It outperforms baselines in accuracy-cost balance, reducing peak context size by up to 52% and improving task completion by 37% in some cases. LRE achieves superior answer quality with 68% fewer tokens and requires no annotations or neural computation for training.

arxiv arXiv cs.CL · 2d ago

Beaver: Agent Harness for Scientific Curation from Multimodal Sources

Beaver is an agent harness that extracts structured information from scientific papers by integrating multimodal evidence tooling, task scaffolding, and artifact-grounded autoresearch. It achieves 81.0 on the Gold-Referenced Attribute Score, outperforming frontier agents by over 23 points, with key gains on high-value attributes requiring cross-modal reasoning.

arxiv arXiv cs.AI · 6d ago

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

UltraQuant enables 4-bit KV caching for context-heavy agents, reducing P50 time-to-first-token by 3.47x in late rounds and boosting output throughput by 1.63x over FP8 KV baseline. It achieves this using FP8 queries, FP4 KV tensors, UE8M0 group scales, and native scaled-MFMA on AMD CDNA4 GPUs, with optimizations for decode-attention kernels and robust design choices like asymmetric K/V treatment and Walsh-Hadamard rotation.

arxiv arXiv cs.LG · 6d ago

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

UltraQuant introduces a 4-bit KV caching method tailored for context-heavy agent workloads. It achieves 3.47x reduction in P50 time-to-first-token in late rounds and 1.63x higher output throughput compared to FP8 KV caching, using FP8 queries, FP4 KV tensors, and native AMD CDNA4 scaled-MFMA support.

arxiv arXiv cs.CL · 6d ago

H-RePlan: Hierarchical Recovery for Cross-Device Agent Systems

H-RePlan introduces a hierarchical replanning framework that separates device-local strategy recovery from global orchestrator replanning. It outperforms existing baselines by achieving higher completion and instruction adherence, with reduced token cost, through scope-aware recovery in multi-device agent systems.

arxiv arXiv cs.AI · 6d ago

See-and-Reach: Vision-Language Navigation for UAVs in Field of View

UAV-VLN-FOV isolates the see-and-reach stage for precise evaluation of UAV navigation. 3DG-VLN enhances visual grounding and spatial alignment using dynamic 3D direction cues, achieving a 13.82% success rate improvement over baselines and validated in real-world trials.

arxiv arXiv cs.CL · 6d ago

JAMER: Project-Level Code Framework Dataset and Benchmark

JAMER introduces JamSet and JamBench, the first project-level game code dataset and benchmark on a professional game engine. Built from 8,133 verified Game Jam projects, it enables deterministic evaluation and reveals a capability cliff in AI models as project scale increases, with runtime pass rates dropping from 80.4% to 5.7%.

arxiv arXiv cs.LG · 7d ago

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

OneCanvas enables 3D scene understanding in Vision-Language Models by aggregating patch features onto a single panoramic canvas using 3D world coordinates. It achieves state-of-the-art performance on SQA3D and VSI-Bench, and generalizes to out-of-distribution data on SPBench, using significantly less training compute than existing methods.

arxiv arXiv cs.AI · 7d ago

User as Engram: Local Parametric Edits for Personal Memory

User as Engram proposes storing per-user facts as surgical, hash-keyed edits to a memory table, leaving reasoning in a shared adapter. This design achieves 5.6x higher indirect-reasoning accuracy and maintains base-level reasoning performance, with a memory footprint 33,000x smaller than per-user LoRA. The approach enables disjoint user edits that compose losslessly, outperforming retrieval pipelines beyond 100 facts.

arxiv arXiv cs.AI · 7d ago

Data Intelligence Agents Enable Autonomous Data Querying

Data Intelligence Agents (DIA) deploy autonomous coding agents to streamline enterprise data workflows. The Query Generator matches or exceeds top published results on seven SQL benchmarks across four dialects, showing generalization through natural-language instructions and execution-based architecture.

arxiv arXiv cs.LG · 8d ago

NoiseTilt: Noise-Tilted Reverse Kernels for Diffusion Reward Alignment

NoiseTilt introduces NTRK, a reward-guided diffusion sampler that injects reward gradients via the noise term without altering the reverse kernel. By using a whitening operator, NTRK safely biases noise toward high reward, preserving sample quality while maintaining strong guidance. On aesthetic generation, NTRK achieves superior reward performance with 25 NFEs, reducing compute by 20× compared to state-of-the-art baselines.

arxiv arXiv cs.AI · 9d ago

BinTrack: Open-Source Spatial QA with Binary Trajectory Search

BinTrack is a fully open-source spatial question answering agent that uses binary search over a robot's trajectory to locate answers. It achieves up to 22.8% higher accuracy than other open-source methods and matches closed-source model performance on the most challenging global category of the SpaceLocQA benchmark. The system also offers over 1.5x faster inference and introduces GangnamLoop, a real-world outdoor benchmark collected with a quadruped robot.

media r/LocalLLaMA · 2d ago

Boogu-Image-0.1: Open-Source Unified Image Generation and Editing Model Series

Boogu-Image-0.1 is an Apache-2.0 licensed open-source unified image generation and editing model family, including Base, Turbo, and Edit variants. It offers high-quality text-to-image generation, fast generation, image editing, and strong Chinese-English text rendering, with training data scale roughly one order of magnitude smaller than closed-source systems yet achieving competitive performance through improved model understanding and data quality.

lab Hugging Face Blog · 2d ago

Shipping huggingface_hub weekly with AI, open tools, and human oversight

Hugging Face is releasing huggingface_hub weekly, integrating AI models, open-source tools, and a human review process to ensure quality and safety. The update emphasizes transparency, community involvement, and responsible AI development through continuous human-in-the-loop validation.

media Hugging Face Forums · 2d ago

My Hugging Face Account Was Locked

A user reports their Hugging Face account, AntixStudioDesign, was locked unexpectedly during experimentation with AI tools. They have contacted the Safety Team via email and seek advice on account recovery, response time, and data preservation options.

arxiv arXiv cs.CL · 2d ago

CAT-Translate: Compact Japanese-English Models Outperform Multilingual Ones in Real-World Tasks

CAT-Translate introduces a family of small, open-source models specialized for Japanese-English translation. Using synthetic parallel corpora and a two-stage fine-tuning approach, the models achieve superior performance on real-world benchmarks across business, legal, medical, financial, and patent domains, outperforming large multilingual models in practical applications.

media Hugging Face Forums · 2d ago

AI Music Model Runs in Real Time on Most CPUs in Browser

NanoMaestro Realtime is a 50MB AI music model with 13M parameters that generates piano music in real time using a 2-layer LSTM. It runs locally in the browser via ONNX and Transformers.js with WASM, requiring no GPU or server backend, and works on older Raspberry Pi models.

blog Simon Willison · 2d ago

Porting Moebius 0.2B Image Inpainting to Browser with Claude Code

The Moebius 0.2B image inpainting model has been successfully ported to run in the browser using WebGPU and ONNX Runtime. The project, initiated with Claude Code, converts the model's weights to ONNX and deploys them via Hugging Face, with a simple web interface available at simonw.github.io/moebius-web/.

media r/LocalLLaMA · 5d ago

Fixing Long-Context Decode Cliff on Radeon R9700 with vLLM 0.22.1

A long-context decode performance cliff on AMD Radeon AI PRO R9700 (RDNA4) was resolved by enabling AITER Unified Attention in vLLM 0.22.1. The fix involves relaxing a CDNA gate to include RDNA4, disabling other attention backends, and using bf16 KV cache, resulting in significant speedups across all context lengths. FP8 KV is ineffective on this hardware, and the model's native 262K context is fully achievable with bf16, offering ~2.9× concurrency without needing FP8.

Buddy System: Rust entropy monitor with NER-gated uncertainty for tiered LLM inference

LRE: Few-Kilobytes Agent Memory with Zero Neural Cost

Beaver: Agent Harness for Scientific Curation from Multimodal Sources

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

H-RePlan: Hierarchical Recovery for Cross-Device Agent Systems

See-and-Reach: Vision-Language Navigation for UAVs in Field of View

JAMER: Project-Level Code Framework Dataset and Benchmark

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

User as Engram: Local Parametric Edits for Personal Memory

Data Intelligence Agents Enable Autonomous Data Querying

NoiseTilt: Noise-Tilted Reverse Kernels for Diffusion Reward Alignment

BinTrack: Open-Source Spatial QA with Binary Trajectory Search

Boogu-Image-0.1: Open-Source Unified Image Generation and Editing Model Series

Shipping huggingface_hub weekly with AI, open tools, and human oversight

My Hugging Face Account Was Locked

CAT-Translate: Compact Japanese-English Models Outperform Multilingual Ones in Real-World Tasks

AI Music Model Runs in Real Time on Most CPUs in Browser

Porting Moebius 0.2B Image Inpainting to Browser with Claude Code

Fixing Long-Context Decode Cliff on Radeon R9700 with vLLM 0.22.1