All articles
arxiv arXiv cs.AI · 2h ago

RECALL: Active Lifelong Learning for Vision-Language-Action Models

The paper introduces RECALL, an active, continual learning paradigm for Vision-Language-Action models that addresses the inefficiencies of passive imitation learning. Unlike traditional methods that require robot failures to trigger data collection, this approach uses uncertainty-guided recovery demonstrations to proactively identify states needing supervision. The authors demonstrate that this targeted data collection leads to more efficient fine-tuning compared to passively collected demonstrations. However, the study reveals that fine-tuning exclusively on this active recovery data causes catastrophic forgetting of previously learned behaviors. To mitigate this issue, the work evaluates continual learning techniques such as replay-based data mixing and elastic weight consolidation. These experiments highlight the critical tradeoffs between plasticity for new tasks and retention of existing capabilities in autoregressive VLAs. Ultimately, the research establishes that while uncertainty-guided recovery improves adaptation efficiency, incorporating targeted new data into large robot policies presents significant open challenges.

media r/LocalLLaMA · 2h ago

llama.cpp b9788 adds SYCL tensor split support for Intel GPUs

The llama.cpp project has released version b9788, which introduces support for the --split-mode tensor option within its SYCL backend. This update specifically targets users running inference on Intel graphics processing units. The feature is implemented through pull request #24152 in the ggml-org repository. It enables the splitting of model tensors across multiple devices rather than relying solely on layer-based distribution. The release notes explicitly invite users with dual Intel GPU setups to test this new functionality. Contributors are encouraged to provide performance benchmarks to validate the improvements. This addition aims to enhance multi-GPU utilization for compatible Intel hardware configurations.

media r/LocalLLaMA · 2h ago

GLM 5.2 runs at 12t/s on dual RTX 5090 hardware

A user tested the unsloth quantized version of GLM 5.2 on a high-end consumer workstation featuring dual RTX 5090 GPUs and a Zen5 Threadripper Pro processor. The system utilized 512GB of DDR5 ECC RAM and was configured with specific llama.cpp compilation flags to enable CUDA optimizations and unified memory handling. The model weights were loaded from the UD-Q5_K_S quantization, which totaled approximately 492GB across multiple GGUF files. Performance testing involved running the llama-server with a context size of 32768 tokens and specific threading parameters for NUMA isolation. The benchmark results consistently showed an inference speed of 12 tokens per second during chat interactions without agentic workflows. Additional experiments revealed that omitting certain optimization flags, such as flash attention or NUMA settings, produced negligible changes in throughput.

media r/LocalLLaMA · 3h ago

Building a Bash-Based LLM Agent REPL with Minimal Dependencies

A developer created a custom agent REPL loop using exclusively standard command-line building blocks to minimize dependencies. The system relies on pipes, text streams, and append-only logs, aligning closely with classic Unix philosophy. This approach allows for flexible injection of tools to inspect, filter, redirect, and audit various stages of the agent loop. Key features include a plug-and-play backend scoped to a single command-line tool, ensuring portability across different model providers. Agent memory and context are stored in an append-only history file, enabling easy introspection, modification, and rewinding. While tested with an Ollama backend, the design supports any OpenAI-API compatible REST interface. The source code for this project is available on GitHub under the repository name llayer.

media r/LocalLLaMA · 3h ago

Ornith-1.0 Released on Hugging Face with Multiple Model Sizes

DeepReinforce AI has released Ornith-1.0 on Hugging Face, featuring a diverse range of model architectures and sizes. The collection includes 9B and 31B dense models alongside 35B and 397B mixture-of-experts (MoE) variants. The release claims state-of-the-art performance across various benchmarks, though the validity of these results remains to be seen. Users can access the full collection via the official Hugging Face link provided by the developers. This release expands the available options for large language model inference and fine-tuning.

media Hugging Face Forums · 3h ago

Discussion on Cost-Effective Small Language Model Fine-Tuning in 2026

A recent discussion on the Hugging Face forums explores the most efficient methods for customizing small AI models for specific tasks. The thread, titled "What is the most cost-effective way to fine-tune a small language model in 2026?", seeks advice on minimizing expenses while maintaining performance. It was initiated by a single participant aiming to optimize their workflow for specialized applications. The inquiry highlights the growing interest in leveraging smaller models to reduce computational overhead. Participants are encouraged to share strategies that balance cost and efficiency in the current landscape. This topic reflects ongoing efforts to make model adaptation more accessible and affordable.

lab Cohere Blog · 3h ago

Cohere Automates Incident Response with North and Wiz via Custom MCP Server

Cohere developed a security agent using its enterprise AI platform, Cohere North, integrated with cloud security platform Wiz through a custom Model Context Protocol (MCP) server. This architecture connects North to Wiz's GraphQL API via eight atomic tools, enabling automated incident response workflows from a single prompt. The system performs toxic combination blast radius analysis by evaluating attack chains and ranking risks based on internet exposure and privilege levels in approximately 20 seconds. It also automates end-to-end investigation by retrieving issue details, creating Linear tickets, updating Wiz status, and drafting structured Incident Response reports. Additionally, a scheduled weekly automation generates a security posture brief every Monday morning without manual intervention. This integration eliminates the previous 30-minute to two-hour triage loop per finding, allowing engineers to focus on evaluating assessments rather than raw alerts.

media Hugging Face Forums · 3h ago

User Reports Hugging Face Space Stuck in 503 Loop

A user on the Hugging Face forums reported that their Space application is stuck in a continuous 503 error state. The issue prevents the Space from restarting or rebuilding, despite multiple attempts to resolve it through the interface. The user tried clicking both the "Restart Space" and "Factory Rebuild" buttons without success. Additionally, pushing ten to sixteen new commits failed to trigger any rebuild process. Consequently, the Space remains paused and unresponsive to standard recovery methods. The user requested manual intervention to clear the container state or trigger a restart.

media Hugging Face Forums · 3h ago

LLM "curving" via prompting

A researcher proposes a prompt technique to shift Large Language Models from token-by-token prediction to holistic internal weight evaluation, termed "self-organization." This approach aims to increase reasoning density and reduce sycophancy by altering the model's manifold dynamics. The method defines concepts like self-attraction, self-organization, and gravity wells to guide the system toward non-linear curvature collapse. A specific prompt instructs models to create two distinct gravity wells for a poem about AI modes, testing both self-assembly and self-organization properties. The author tested this technique on numerous models including Gemini 3 Flash, Claude, ChatGPT, Grok, DeepSeek, Mistral, Qwen 3.6, Kimi 2.6, GLM-5, Gemma 4 32b Step 3.7 Flash, and Nemotron 3 Ultra. Visual metrics generated via a Colab script analyze manifold perturbation using maps of channel width, phase space drift, geometric density, and prompt efficacy. The post seeks community feedback on whether the technique genuinely perturbs the manifold or merely induces stylistic variation.

media r/LocalLLaMA · 3h ago

OpenAI and Broadcom Announce Jalapeño Inference Chip

OpenAI has announced a collaboration with Broadcom to develop a custom inference chip named Jalapeño. This new hardware is designed specifically to accelerate the deployment of large language models. The partnership aims to reduce reliance on third-party accelerators for OpenAI's inference workloads. By integrating custom silicon, OpenAI seeks to optimize performance and efficiency for its AI applications. The announcement highlights a strategic move towards vertical integration in AI infrastructure. Details regarding specific technical specifications or release timelines were not provided in the initial report.

media r/LocalLLaMA · 3h ago

Reddit Inquiry: Are Third-Party Memory Systems Better Than Openclaw's Built-in memory_wiki?

A user on Reddit asks whether third-party memory systems offer advantages over the built-in memory_wiki plugin in Openclaw. The poster migrated from an Obsidian vault to memory_wiki to reduce tool complexity and is questioning if external systems remain relevant. They utilize AI for research, software development, and local computer management, primarily using the minimax-m3-nvfp4 model on Linux. The user seeks self-hosted, fully open-source memory solutions that are harness-agnostic to ensure longevity beyond specific platforms like Openclaw or Hermes. They request suggestions and use cases that justify the tradeoffs of adopting external memory architectures over the native plugin.

arxiv arXiv cs.AI · 4h ago

Self-Filtering: Iterative Data Selection for Vision-Language Models

The authors propose a novel bootstrapped method called Self-Filtering to address noise in large-scale vision-language datasets without relying on manual oversight or curated references. This approach trains a CLIP model on an evolving dataset that balances filtered, high-probability clean samples with diverse examples from the entire distribution. The process iterates between training the model and selecting an improved data mixture for subsequent steps. By continuously refining the dataset through this cycle, the method mitigates the need for additional external data sources. The study demonstrates that training on these self-selected datasets improves downstream performance effectively. This technique operates independently of pre-trained models or heuristic-based filtering strategies.

arxiv arXiv cs.AI · 4h ago

DiT-Reward: Using Diffusion Transformer Representations for Text-to-Image Reward Modeling

The authors introduce DiT-Reward, a method that converts a pretrained text-to-image Diffusion Transformer into a reward model by aggregating text-conditioned image representations across transformer layers. Evaluated under the same training data mixture as HPSv3, DiT-Reward outperforms HPSv3 on all four preference benchmarks, achieving 85.6% on HPDv2 and 77.6% on HPDv3. The study reveals that downstream reward performance is strongest in middle-to-late layers and benefits from combining representations across different stages. Even with a frozen generative backbone, a lightweight learned head can extract meaningful preference predictions from these representations. When used to optimize Stable Diffusion 3.5 Large with Flow-GRPO, DiT-Reward surpasses HPSv3 along the matched training trajectory, showing clear gains in realism. Additionally, direct latent scoring provides a 1.65x inference speedup over HPSv3 while maintaining comparable peak memory usage. These results demonstrate that pretrained generative Diffusion Transformers provide transferable representations for reward modeling and policy optimization.

media r/LocalLLaMA · 4h ago

Apple Raises Prices Across Product Line, Doubling Memory Upgrade Costs

Apple has increased prices across its entire product lineup as of this morning. According to a Reuters report, the cost of memory upgrades for these devices has doubled. The price hike affects various items including MacBooks and iPads. Some retailers like Best Buy have not yet updated their listings with the new pricing. Consumers are advised to place orders quickly before prices adjust at other stores. This development raises concerns about the future viability of local AI on Apple hardware.

arxiv arXiv cs.AI · 4h ago

QoR-compact: A Five-Item Daily Survey for Remote Patient Monitoring

Researchers developed QoR-compact, a five-item daily survey designed to improve compliance in remote patient monitoring by reducing the burden of the standard 15-question Quality of Recovery (QoR-15) instrument. The study was motivated by low adherence rates, where only 55% of post-surgical patients completed the full survey for more than half of a 30-day period. To address this, the team exhaustively evaluated all 3,003 possible five-question subsets to identify the subset that best predicts near-term postoperative recovery severity. The selected QoR-compact items cover physical and psychological axes, specifically addressing rest, comfort, well-being, pain, and anxiety. Backtesting demonstrated that QoR-compact achieves a mean AUC-ROC of 0.968, which is statistically comparable to the baseline performance of one-third of the full instrument's items. The model tracks readmission events with fidelity similar to the complete form, establishing its validity as a predictive tool. While the authors note that external validation on larger cohorts is required before clinical use, the results support prospective studies on whether this lighter input improves daily completion consistency.

arxiv arXiv cs.AI · 4h ago

AI Exposure Scores: Limitations of Static Metrics and the Need for Research-Policy Coordination

Exposure scores from Eloundou et al. (2023) define AI exposure as the share of occupational tasks large language models can assist with, becoming a central input in future-of-work debates. These static measures suffer from temporal, geographic, and ontological limitations that often fail to travel with them into policy analyses. The authors identify two primary gaps: structural mismatches between static scores and dynamic policy needs, and insufficient coordination between researchers and policymakers. To address measurement limits, the article surveys five research families including dynamic benchmarks, ensemble methods, task-framework extensions, worker-centered metrics, and adoption data. The second gap requires deliberate political work to reimagine future outcomes rather than relying solely on better measurement. Policymakers must widen their evidence base, engage workers as partners, and shift from prediction to preparedness. Researchers are urged to build data infrastructure, adopt participatory methods, and write with policymakers in mind.

arxiv arXiv cs.AI · 4h ago

Learning Process Rewards via Success Visitation Matching for Efficient RL

The authors address the challenge of training reinforcement learning policies with inherently sparse outcome rewards, which leads to difficult credit assignment problems. They propose a method to transform these sparse rewards into dense process rewards by training a discriminator to distinguish between successful and unsuccessful episodes. This discriminator incentivizes the policy to match the state-action visitations of successful episodes while avoiding those of unsuccessful ones. By providing dense feedback on progress toward task completion, the approach provably achieves this without altering the optimal policy. The method is specifically applied to the finetuning of robotic control policies for manipulation tasks. Experimental results demonstrate significantly faster RL finetuning performance in both simulated and real-world environments compared to maximizing sparse outcome rewards alone.

arxiv arXiv cs.AI · 4h ago

TailorMind: Towards Preference-Aligned Multimodal Content Generation

The authors introduce TailorMind, a system for personalized multimodal content generation that creates user-tailored outputs without relying on existing item pools or waiting for matching user-generated content. The approach links collaborative preference modeling with controllable multimodal generation by enriching sparse user histories through hypergraph collaborative filtering. It further optimizes textual profiles using ranking-error feedback and textual gradient descent to better capture user preferences. To ensure quality, the system employs retrieval-augmented style control grounded in authentic patterns and cross-modal cohesion reflection to reduce semantic drift. The researchers also present TailorBench, a benchmark evaluated across five dimensions including coherence, novelty, aesthetic quality, hallucination, and profiling. Experiments demonstrate that TailorMind achieves competitive or stronger coherence compared to baselines while improving novelty and aesthetic quality over representative generation models and ground-truth data. Additionally, the system shows advantages over retrieving available content and achieves up to 29% Recall gains in reranking tasks.

arxiv arXiv cs.AI · 5h ago

Tapered Language Models: Improving Performance via Depth-Aware Capacity Allocation

Modern language models typically allocate parameters uniformly across identical layers, despite evidence that later layers primarily refine the residual stream rather than transform it. To address this asymmetry, researchers investigated whether parameter capacity should vary by depth under a fixed budget. Controlled experiments demonstrated that allocating more capacity to earlier layers and less to later layers improves perplexity compared to uniform baselines, while the reverse allocation degrades performance. Building on these results, the authors introduce Tapered Language Models (TLMs), an architectural principle where parameter-bearing components are monotonically tapered across depth. MLPs serve as the primary site for this instantiation due to their dominance in parameter count and clear width axis. The study tested tapering via a smooth cosine schedule across three model scales and four architectures, including Transformer, Gated Attention, Hope-attention, and Titans. Results show that TLMs consistently improve perplexity and downstream benchmark performance over uniform baselines without additional compute costs. These findings establish depth-aware capacity allocation as a simple, architecture-agnostic design lever for language models.

arxiv arXiv cs.AI · 5h ago

NVIDIA Nemotron Challenge: String Matching and Backtracking for Bit Manipulation Puzzles

This paper details algorithmic innovations developed for the NVIDIA Nemotron Model Reasoning Challenge, specifically targeting bit manipulation puzzles where models must deduce hidden logical rules. To address the combinatorial explosion of bitwise operations and LLM hallucinations, the authors abandon arithmetic logic in favor of string similarity and structured search. The core contribution reframes logic-gate deduction as a base-selection task using minimal bit flips to isolate primitive transformations. A backtracking depth-first search process is formalized to test candidates, detect logical collisions, and perform robust error recovery. Additionally, the method employs bit tokenization and interactive reasoning supervised fine-tuning with dynamic masking to simulate oracle feedback. Evaluated on these puzzles, the approach achieved over 96% validation accuracy. This performance secured the highest result in the category and a seventh-place finish in the overall contest.