All articles
arxiv arXiv cs.AI · 4h ago

Kamera: Training-Free Position-Invariant Multimodal KV Cache for Efficient Reuse

The authors introduce Kamera, a method that enables training-free reuse of multimodal key-value caches by addressing the loss of cross-chunk conditioning in naive prefix caching. Standard state-merge recovers direct readouts but fails to preserve the diffuse, low-rank residue in deep layers essential for multi-hop reasoning, which halves accuracy. To repair this, Kamera stores a small, training-free low-rank conditioning patch alongside each position-free chunk. This approach allows exact RoPE re-rotation and cross-chunk binding restoration across MLA, GQA, and MHA attention mechanisms. The system supports cheap reorder, sliding-window survival, and recall operations without requiring re-encoding of evicted chunks. Experiments show that a rank-m patch recovers full task accuracy on cross-chunk-binding benchmarks like MM-NIAH and two-page doc-QA. The solution reconstructs re-prefill KV to within bf16 rounding in a production SGLang kernel across six backbones while maintaining a fraction of the original KV footprint.

arxiv arXiv cs.AI · 4h ago

Decentralized Autonomous Traffic Management through Corridor Networks

This study addresses the insufficiency of centralized management for high-density autonomous aircraft traffic by proposing a decentralized approach using multi-agent reinforcement learning. The researchers extend this MARL framework to manage traffic flow within complex air corridor networks featuring merges and splits. Policies trained in single-corridor settings are tested on increasingly complex multi-corridor scenarios in a zero-shot manner without retraining. Experimental results show that learned behaviors transfer effectively across varying traffic densities, network geometries, and heterogeneous vehicle performances. The evaluation measures system-level performance through conformance to boundaries, completion rates, average speeds, distance traveled, and inter-aircraft separation. Despite requiring only locally coordinated entry, traversal, and exit behaviors, the collective actions produce desirable traffic flows throughout the corridor network.

arxiv arXiv cs.AI · 4h ago

Enactor: A Generative Model for Closed-Loop Microsimulation of Signalized Intersections

The authors introduce Enactor, an actor-centric generative model designed for closed-loop microsimulation at signalized intersections. Unlike traditional simulators that rely on hand-crafted rules or short-horizon predictors, Enactor focuses on vehicle dynamics while treating pedestrians as contextual influences. The architecture encodes dynamic actors and lane polylines in polar coordinates relative to the intersection center. A transformer with separate spatial and temporal attention blocks predicts a distribution over each actor's next-step motion parameters. Training employs a closed-loop curriculum, exposing the model to its own predictions to ensure stability during simulation. Evaluations on two intersection geometries show Enactor recovers SUMO data generator distributions with significantly lower KL divergence than transformer baselines. The model also reduces red-light violations by more than an order of magnitude and outperforms constant-velocity baselines on real-world field data.

arxiv arXiv cs.AI · 4h ago

Persistent Homology Detects and Steers LLM Responses to Ill-Posed Questions

Researchers propose using finite zero-dimensional persistent homology to represent the topology of ill-posed questions within large language models. The method models contextual hidden states as point clouds, summarizing each transformer layer with three descriptors: mean finite lifetime, normalized lifetime entropy, and largest-lifetime concentration. These descriptors are concatenated across layers to form a unified topological representation of the query's internal state. The study introduces topology-conditioned activation steering, which retrieves similar examples to construct interventions that encourage clarification or abstention. Evaluations on AmbigQA, SituatedQA, and CLAMBER show this approach outperforms prompt-based baselines, improving classification accuracy from 67.4% to 78.9% on AmbigQA. On SituatedQA, accuracy increased from 79.9% to 88.5%, while CLAMBER saw gains from 57.6% to 69.6%. Additionally, the steering mechanism raised the average total acceptable response rate from 61.4% to 70.6% across three open-weight LLMs.

arxiv arXiv cs.AI · 4h ago

SPIRAL: Learning to Search and Aggregate

The authors introduce Sequential-Parallel-Aggregative Reinforcement Learning (SPIRAL), a framework that trains language models to utilize sequential, parallel, and aggregative reasoning primitives simultaneously. Unlike standard post-training methods that optimize only for single-trace sequential reasoning, SPIRAL unifies these components into a single inference compute pipeline. The model first samples independent traces in parallel using chain-of-thought reasoning and then generates a final aggregation trace conditioned on those inputs. This entire process is optimized end-to-end against the reward of the final aggregated response using set reinforcement learning and standard reinforcement learning techniques. Experiments on reasoning tasks demonstrate that SPIRAL effectively scales with inference compute resources. The approach outperforms GRPO by up to 11 times in scaling efficiency and achieves 15% higher performance when all three compute primitives are scaled.

arxiv arXiv cs.AI · 4h ago

Against Proxy Optimization

The author discusses the conditions under which maximizing a proxy utility function can lead to harmful outcomes. This analysis suggests that such scenarios pose significant problems for the application of standard decision theory. The text highlights specific circumstances where optimizing for a surrogate goal diverges from intended results. These findings challenge the robustness of current theoretical frameworks used in artificial intelligence and economics. By identifying these failure modes, the work aims to refine how agents should be designed to avoid unintended consequences.

arxiv arXiv cs.AI · 4h ago

Polycepta: Object-Centric Appearance Estimation for Multi-Object Tracking

The authors introduce Polycepta, an object-centric appearance state estimation framework that reformulates appearance modeling as a recursive estimation problem. Unlike traditional methods relying on static, frame-independent descriptors, Polycepta constructs and continuously updates independent appearance states for each tracked object. This approach allows future representations to be estimated from accumulated observations rather than memorizing them through a specific learning strategy. A key feature is that appearance estimation quality improves progressively as object states evolve during inference. The framework enables appearance estimation for unseen classes by encouraging the learning of object-specific representation construction. Extensive experiments on KITTI, Waymo Open Dataset, and MOT17 demonstrate consistent reductions in identity switches and improved tracking performance. When integrated into the RobMOT framework, Polycepta operates at 90.57 Hz and achieves a MOTA of 92.27% on the KITTI benchmark.

arxiv arXiv cs.AI · 4h ago

Dual-Learned Matching Enables Linear Mode Connectivity for Billion-Parameter Transformers

Researchers propose a scalable framework to enable linear mode connectivity-based merging for billion-parameter pretrained transformers. Existing methods typically optimize interpolation paths from only one model endpoint, limiting scalability for large architectures. The new approach applies parameterized weight transformations to align functionally equivalent solutions and uses a dual learning procedure where both models jointly learn transformations toward a shared path. This bidirectional optimization substantially reduces interpolation barriers and improves merging reliability across large-scale models. Empirically, the method achieves near-zero loss barriers on WikiText for medium-sized language models. In vision tasks, ViT-L maintains above 69% ImageNet top-1 accuracy throughout the interpolation path. Modern billion-parameter LLMs exhibit only small loss barriers using this technique.

arxiv arXiv cs.AI · 5h ago

Causal Discovery in the Era of Agents

Recent efforts to integrate large language models with causal discovery often rely on inferring graph structures or injecting outputs as priors, which risks conflating textual associations with genuine causal evidence. The authors argue that agents should instead assist the workflow by inspecting data, retrieving context, and clarifying assumptions without supplying edges, orientations, or causal conclusions. They propose a principle ensuring that causal claims remain grounded in data, explicit assumptions, formal algorithms, diagnostics, and expert decisions. To instantiate this approach, they introduce causal-learn+, an online platform coordinating preprocessing, method recommendation, and interpretation within the causal-learn ecosystem. A case study on Big Five personality data demonstrates an agent-assisted pipeline that avoids treating language model unreliability as causal evidence. The platform is available at causallearn.com.

arxiv arXiv cs.AI · 5h ago

Neural Classification Trees Disentangle Latent Subgroups for Robust ML

Machine learning models often exploit spurious correlations, leading to high average accuracy but poor performance on underrepresented subgroups. Existing mitigation strategies typically adjust network parameters using subgroup annotations or inferred pseudo-labels. However, these methods generally output only a class prediction at inference time, lacking insight into a sample's latent subgroup structure. To address this, the authors propose Neural Classification Trees (NCT), a framework that encodes subgroup structure within its tree-shaped architecture. NCT routes each sample to an easy or hard node based on prediction correctness and reuses these routes as pseudo-labels for subsequent iterations. This process disentangles conflicting subgroups without requiring explicit subgroup supervision. The approach was evaluated on five benchmarks spanning binary and multi-class spurious correlations. Experiments demonstrate that the learned tree topology isolates minority subgroups, providing strong interpretability and competitive robustness compared to state-of-the-art methods.

arxiv arXiv cs.AI · 5h ago

RECALL: Active Lifelong Learning for Vision-Language-Action Models

The paper introduces RECALL, an active, continual learning paradigm for Vision-Language-Action models that addresses the inefficiencies of passive imitation learning. Unlike traditional methods that require robot failures to trigger data collection, this approach uses uncertainty-guided recovery demonstrations to proactively identify states needing supervision. The authors demonstrate that this targeted data collection leads to more efficient fine-tuning compared to passively collected demonstrations. However, the study reveals that fine-tuning exclusively on this active recovery data causes catastrophic forgetting of previously learned behaviors. To mitigate this issue, the work evaluates continual learning techniques such as replay-based data mixing and elastic weight consolidation. These experiments highlight the critical tradeoffs between plasticity for new tasks and retention of existing capabilities in autoregressive VLAs. Ultimately, the research establishes that while uncertainty-guided recovery improves adaptation efficiency, incorporating targeted new data into large robot policies presents significant open challenges.

media r/LocalLLaMA · 5h ago

llama.cpp b9788 adds SYCL tensor split support for Intel GPUs

The llama.cpp project has released version b9788, which introduces support for the --split-mode tensor option within its SYCL backend. This update specifically targets users running inference on Intel graphics processing units. The feature is implemented through pull request #24152 in the ggml-org repository. It enables the splitting of model tensors across multiple devices rather than relying solely on layer-based distribution. The release notes explicitly invite users with dual Intel GPU setups to test this new functionality. Contributors are encouraged to provide performance benchmarks to validate the improvements. This addition aims to enhance multi-GPU utilization for compatible Intel hardware configurations.

media r/LocalLLaMA · 5h ago

GLM 5.2 runs at 12t/s on dual RTX 5090 hardware

A user tested the unsloth quantized version of GLM 5.2 on a high-end consumer workstation featuring dual RTX 5090 GPUs and a Zen5 Threadripper Pro processor. The system utilized 512GB of DDR5 ECC RAM and was configured with specific llama.cpp compilation flags to enable CUDA optimizations and unified memory handling. The model weights were loaded from the UD-Q5_K_S quantization, which totaled approximately 492GB across multiple GGUF files. Performance testing involved running the llama-server with a context size of 32768 tokens and specific threading parameters for NUMA isolation. The benchmark results consistently showed an inference speed of 12 tokens per second during chat interactions without agentic workflows. Additional experiments revealed that omitting certain optimization flags, such as flash attention or NUMA settings, produced negligible changes in throughput.

media r/LocalLLaMA · 5h ago

Building a Bash-Based LLM Agent REPL with Minimal Dependencies

A developer created a custom agent REPL loop using exclusively standard command-line building blocks to minimize dependencies. The system relies on pipes, text streams, and append-only logs, aligning closely with classic Unix philosophy. This approach allows for flexible injection of tools to inspect, filter, redirect, and audit various stages of the agent loop. Key features include a plug-and-play backend scoped to a single command-line tool, ensuring portability across different model providers. Agent memory and context are stored in an append-only history file, enabling easy introspection, modification, and rewinding. While tested with an Ollama backend, the design supports any OpenAI-API compatible REST interface. The source code for this project is available on GitHub under the repository name llayer.

media r/LocalLLaMA · 5h ago

Ornith-1.0 Released on Hugging Face with Multiple Model Sizes

DeepReinforce AI has released Ornith-1.0 on Hugging Face, featuring a diverse range of model architectures and sizes. The collection includes 9B and 31B dense models alongside 35B and 397B mixture-of-experts (MoE) variants. The release claims state-of-the-art performance across various benchmarks, though the validity of these results remains to be seen. Users can access the full collection via the official Hugging Face link provided by the developers. This release expands the available options for large language model inference and fine-tuning.

media Hugging Face Forums · 5h ago

Discussion on Cost-Effective Small Language Model Fine-Tuning in 2026

A recent discussion on the Hugging Face forums explores the most efficient methods for customizing small AI models for specific tasks. The thread, titled "What is the most cost-effective way to fine-tune a small language model in 2026?", seeks advice on minimizing expenses while maintaining performance. It was initiated by a single participant aiming to optimize their workflow for specialized applications. The inquiry highlights the growing interest in leveraging smaller models to reduce computational overhead. Participants are encouraged to share strategies that balance cost and efficiency in the current landscape. This topic reflects ongoing efforts to make model adaptation more accessible and affordable.

lab Cohere Blog · 5h ago

Cohere Automates Incident Response with North and Wiz via Custom MCP Server

Cohere developed a security agent using its enterprise AI platform, Cohere North, integrated with cloud security platform Wiz through a custom Model Context Protocol (MCP) server. This architecture connects North to Wiz's GraphQL API via eight atomic tools, enabling automated incident response workflows from a single prompt. The system performs toxic combination blast radius analysis by evaluating attack chains and ranking risks based on internet exposure and privilege levels in approximately 20 seconds. It also automates end-to-end investigation by retrieving issue details, creating Linear tickets, updating Wiz status, and drafting structured Incident Response reports. Additionally, a scheduled weekly automation generates a security posture brief every Monday morning without manual intervention. This integration eliminates the previous 30-minute to two-hour triage loop per finding, allowing engineers to focus on evaluating assessments rather than raw alerts.

media Hugging Face Forums · 6h ago

User Reports Hugging Face Space Stuck in 503 Loop

A user on the Hugging Face forums reported that their Space application is stuck in a continuous 503 error state. The issue prevents the Space from restarting or rebuilding, despite multiple attempts to resolve it through the interface. The user tried clicking both the "Restart Space" and "Factory Rebuild" buttons without success. Additionally, pushing ten to sixteen new commits failed to trigger any rebuild process. Consequently, the Space remains paused and unresponsive to standard recovery methods. The user requested manual intervention to clear the container state or trigger a restart.

media Hugging Face Forums · 6h ago

LLM "curving" via prompting

A researcher proposes a prompt technique to shift Large Language Models from token-by-token prediction to holistic internal weight evaluation, termed "self-organization." This approach aims to increase reasoning density and reduce sycophancy by altering the model's manifold dynamics. The method defines concepts like self-attraction, self-organization, and gravity wells to guide the system toward non-linear curvature collapse. A specific prompt instructs models to create two distinct gravity wells for a poem about AI modes, testing both self-assembly and self-organization properties. The author tested this technique on numerous models including Gemini 3 Flash, Claude, ChatGPT, Grok, DeepSeek, Mistral, Qwen 3.6, Kimi 2.6, GLM-5, Gemma 4 32b Step 3.7 Flash, and Nemotron 3 Ultra. Visual metrics generated via a Colab script analyze manifold perturbation using maps of channel width, phase space drift, geometric density, and prompt efficacy. The post seeks community feedback on whether the technique genuinely perturbs the manifold or merely induces stylistic variation.

media r/LocalLLaMA · 6h ago

OpenAI and Broadcom Announce Jalapeño Inference Chip

OpenAI has announced a collaboration with Broadcom to develop a custom inference chip named Jalapeño. This new hardware is designed specifically to accelerate the deployment of large language models. The partnership aims to reduce reliance on third-party accelerators for OpenAI's inference workloads. By integrating custom silicon, OpenAI seeks to optimize performance and efficiency for its AI applications. The announcement highlights a strategic move towards vertical integration in AI infrastructure. Details regarding specific technical specifications or release timelines were not provided in the initial report.