All articles
media r/LocalLLaMA · 3h ago

LFM2.5 230M Runs In-Browser at 1,400 tok/s via Custom WebGPU Kernels

The LiquidAI LFM2.5-230M model is now running locally in the browser using custom WebGPU kernels. These specialized kernels were originally developed by Fable 5 prior to its shutdown and Opus 4.8. The demonstration was recorded on an M4 Max device, achieving a generation speed of 1,400 tokens per second. All processing occurs entirely within the user's browser environment without external server dependencies. A GGUF version of the model is available for download on Hugging Face alongside the standard checkpoint. Users can interact with the live demo hosted by the webml-community on Hugging Face Spaces.

media r/LocalLLaMA · 3h ago

Apple to Skip M6 Pro/Max Chips, Fast-Track M7 for Local AI

A recent report indicates that Apple plans to skip the release of M6 Pro and M6 Max chips in its upcoming lineup. Instead, the company intends to fast-track the development of the M7 chip series to better support local artificial intelligence workloads. This strategic shift suggests a prioritization of on-device AI capabilities over traditional performance increments for the Pro tier. The decision reflects Apple's growing emphasis on integrating advanced machine learning features directly into its hardware architecture. By accelerating the M7 timeline, Apple aims to provide more robust neural engine performance for running large language models locally. This move signals a significant pivot in Apple Silicon's development roadmap toward AI-centric design principles.

arxiv arXiv cs.AI · 3h ago

AOHP: An Open-Source OS-Level Agent Harness for Personalized, Efficient and Secure Interaction

The Android Open Harness Project (AOHP) is an open-source operating system-level agent harness built on the Android Open Source Project. It addresses the mismatch between current application-centric operating systems and the needs of autonomous AI agents by treating agents as first-class OS actors. The design introduces three key mechanisms: personalized service composition, efficient agent interfaces, and secure information flow. These features enable adaptive user interfaces and agent-friendly runtime environments while preserving the existing Android ecosystem. Preliminary experiments on challenging tasks demonstrate significant performance improvements over conventional systems. Specifically, AOHP achieved a 21.12% increase in task completion rates compared to baseline methods. It also reduced token execution costs by 51.55%, highlighting its efficiency gains. Furthermore, the system showed improved compliance with security policies during agent-mediated interactions.

arxiv arXiv cs.AI · 3h ago

Rise of Militarized Language in Scientific Abstracts Erodes Credibility

A study analyzing 21.4 million papers from OpenAlex and PubMed reveals that militaristic terms in scientific abstracts rose by 48% and 32%, respectively, between 2010 and 2025. This increase accelerated sharply after 2019 and correlates strongly with global conflict data at both country and annual scales. Social sciences exhibit the highest prevalence of such language, while engineering and computer science show the fastest growth rates. The analysis also notes that the COVID era and the post-2022 large-language-model period narrowed the linguistic gap between native-English and non-English authors. To assess the impact of this trend, researchers conducted a within-subject war-framing experiment involving 801 participants and over 32,000 trials. The experimental results demonstrated that war framing significantly reduced perceived credibility, funding willingness, and policy support among readers. Although there was a trend-level increase in the sense of urgency, the overall findings suggest that militaristic language may undermine the persuasive power of scientific communication.

media r/LocalLLaMA · 3h ago

Reddit Post: Fully Local AI Assistant Memory Layer

A Reddit user from the r/LocalLLaMA community shared a post titled 'After 2.5 years of evenings and weekends, my fully local AI assistant is finally usable.' The submission focuses on explaining how the memory layer of this personal AI system functions. The content was submitted by the user /u/PAiERAlabs to the subreddit dedicated to local large language models. The post includes a link to a gallery containing additional details about the project. Readers are directed to the comments section for further discussion and technical insights. This entry highlights a long-term personal project aimed at creating a functional, locally hosted AI assistant.

media r/LocalLLaMA · 3h ago

Hugging Face Blocks Multi-Threaded Downloads, Impacting GGUF Ecosystem

Hugging Face has implemented a recent change that blocks multi-threaded download acceleration, resulting in 403 errors for all but one thread per connection. This update significantly affects the GGUF ecosystem, where large single-file models are common and single-thread speeds are often capped at 40 MB/s. Previously, tools like the Hugging Face CLI accelerated downloads by fetching multiple smaller files in parallel, a method now hindered by this restriction. The author notes that downloading an entire branch of GGUF repositories is inefficient due to the presence of large files and multiple quantizations within the same branch. Without a reversal of this policy, download speeds will remain slow unless uploaders transition to splitting models into numerous smaller files across different branches. Such a workaround would require users to manually merge files, which is considered less desirable than Hugging Face restoring previous acceleration capabilities.

arxiv arXiv cs.AI · 4h ago

CADRE: Stable, Parameter Efficient Adaptation of Medical Vision Language Models with Bounded Forgetting and Prior Drift

The authors introduce CADRE, a parameter-efficient framework for adapting medical vision-language models while preventing catastrophic forgetting and prior drift. The method combines low-rank adaptation with an online, self-scaling elastic weight consolidation term to bound retained-competence loss. It also employs an anchor-to-prior penalty to restrict embedding drift from the frozen pretrained model. Two short guarantees regarding consolidation mass and scale invariance address the order fragility found in vanilla EWC. The approach was evaluated on breast cancer data across histopathology, ultrasound, and chest radiography modalities. Training approximately 0.23% of parameters, CADRE achieved the lowest forgetting rate among adapting methods. This represented a sevenfold reduction compared to the strongest regularized baseline, dropping from 0.075 to 0.011. The model also demonstrated positive backward transfer where all baselines showed negative results.

arxiv arXiv cs.AI · 4h ago

DVL-DeepONet: Physics-Guided Operator Learning for Resilient Underwater Navigation

Researchers propose DVL-DeepONet, a physics-guided deep neural operator framework designed to enhance autonomous underwater vehicle navigation under degraded sensing conditions. The system addresses challenges arising from noisy or incomplete Doppler velocity log measurements and the absence of inertial sensors in low-cost platforms. It estimates velocity vectors through three operational scenarios: noise-resilient estimation with coupled sensors, DVL-only learning, and beam measurement recovery. By mapping temporal observations to vehicle velocity while enforcing physical consistency constraints, the model maintains robustness during environmental disturbances. The framework was validated using real-world AUV experiments covering a cumulative path length of approximately 10,000 meters. Experimental results demonstrate that DVL-DeepONet architectures outperform baseline model-based and learning-based algorithms by 40%.

media r/LocalLLaMA · 4h ago

Developer Brings Claude-Style Artifacts to Local Models via TurboLLM

A Reddit user highlights the absence of rendered artifacts in local AI setups compared to Anthropic's Claude. While local models can generate code for dashboards or diagrams, users typically must copy the output elsewhere to view it. To address this gap, the developer experimented with rendering generated HTML, SVG, and Mermaid code directly within the chat interface. The results demonstrated that the limitation lies in the user interface rather than the model's capabilities. A screenshot from the post shows a dashboard rendered by Gemma 4 26B from a single prompt on a desktop. The implementation was built using TurboLLM, which allows for this direct visualization of code outputs. The author invites the community to discuss their workflows and whether they miss Claude's artifact feature.

media r/LocalLLaMA · 4h ago

Reddit User Seeks Private Local LLM for Technical Documentation

A Reddit user is seeking recommendations for a local large language model capable of generating high-level and low-level software designs. The workflow involves using existing templates, cross-referencing code, and integrating with agentic frameworks like OpenCode via MCP to fetch data from Confluence and Jira. The user currently relies on Opus 3.6 through Kiro-cli but requires a solution that ensures data privacy. Key technical constraints include the necessity for at least 256k context length and strong reasoning capabilities. The poster questions whether hardware such as four RTX 3090 GPUs is necessary to achieve this level of performance locally.

arxiv arXiv cs.AI · 4h ago

POTracker Optimizes LLMs for Standard-Compliant Power Outage Report Generation

Recent large language models struggle with domain-specific data generation due to strict formatting and structural requirements. To address the interoperability of utility power outage reports in the United States, researchers propose POTracker, an optimized model for generating machine-readable compliance documents. The team fine-tuned Qwen2.5-7B-Instruct using a novel objective called POTrackerLoss. This new loss function accounts for both textual similarity and structural tag similarity between generated outputs and ground-truth reports. Evaluation on a dataset of 1,000 reports demonstrates that POTracker outperforms five fine-tuning methods and one rule-based XML conversion approach. The model improves overall accuracy by up to 51% and achieves 86.47% structural accuracy for the generated reports. Additionally, a human study involving domain experts assigned an average quality score of 4.03 on a 0-5 scale to the generated labels.

arxiv arXiv cs.AI · 4h ago

SQLConductor: Search-to-Policy Learning for Step-wise Text-to-SQL Orchestration

The authors propose SQLConductor, a step-wise orchestration learning framework for Text-to-SQL that addresses the limitations of fixed pipelines and static plan-then-execute methods. This system formulates subtasks as specialized actions and trains a policy model to select the next action based on intermediate artifacts and feedback. To learn this policy, the framework introduces Search-to-Policy Learning, which utilizes Monte Carlo Tree Search to explore candidate workflows and stability estimation to identify robust supervision. The policy model is trained using Stability-weighted Supervised Fine-tuning to prioritize high-quality orchestration patterns and further enhanced through Curriculum Reinforcement Learning. This approach transforms offline workflow search into a deployable policy for step-wise orchestration at inference time. Experiments on BIRD-Dev and out-of-distribution datasets show that SQLConductor achieves 73.2% execution accuracy, outperforming prior methods with comparable or larger backbones. The results demonstrate superior execution accuracy and strong generalization while coordinating frozen larger action models.

arxiv arXiv cs.AI · 4h ago

VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

The authors introduce VeriEvol, an iterative framework designed to scale multimodal mathematical reasoning by decoupling prompt difficulty from answer reliability. This approach addresses the challenge of maintaining reliable reward labels as data volume increases in reinforcement learning pipelines. The system utilizes a type-aware evolution module to rewrite low-difficulty seeds into harder, image-grounded prompts through route-specific operators. Answer verification is handled by HTV-Agent, which accepts responses only after multi-source counter-evidence fails to refute them. Scaling evolved supervised fine-tuning data from 10K to 250K samples increased mean accuracy on five benchmarks from 35.42 to 54.73. When integrated with a fixed GRPO recipe, VeriEvol provided a cumulative gain of +3.88 over an un-evolved baseline. This improvement is attributed to +1.82 from evolved prompts and +2.06 from the HTV-Agent verifier. The authors release all prompts, data, models, code, and full verifier traces to enable downstream auditing and scaling.

arxiv arXiv cs.AI · 4h ago

Energy Consumption of Transformer Fine-Tuning: A Roofline-Inspired Scaling Model

The authors present a framework for modeling the energy consumption of Transformer training across multiple GPUs, addressing the need for sustainable system design as computational costs rise. By conducting controlled architectural sweeps on BERT models, they relate measured energy usage to lightweight proxies for compute, memory traffic, and hardware efficiency. The approach is inspired by roofline models and incorporates a speedup-based hardware-efficiency factor to account for tensor parallelism and fully sharded data parallelism. This methodology allows for the derivation of a scaling law model that accurately predicts training energy across heterogeneous configurations. The work highlights the critical importance of predicting energy consumption as model size and parallelism scale. It provides a practical tool for cost-aware design in large-scale natural language processing systems.

media r/LocalLLaMA · 4h ago

Reddit User Questions RTX 6000 Pro Value Amidst Price Surge

A Reddit user in the r/LocalLLaMA community is seeking advice on purchasing an NVIDIA RTX 6000 Pro GPU. The poster notes that the price has risen significantly from approximately $8,000 six months ago to around $13,000 currently. They are looking for feedback from existing owners regarding their satisfaction with the hardware. Specifically, the user asks if the card is worth the investment for running models like Qwen 2.5 7B. The post aims to help the buyer justify the expense to their spouse by gathering real-world usage experiences.