All articles
arxiv arXiv cs.LG · 4h ago

Attention Sinks and Collapse Are Universal Consequences of Content-Based Routing

The study demonstrates that attention sinks, representation collapse, and norm stratification are not unique to transformer architectures but are inherent consequences of content-based routing under a fixed similarity metric. It establishes an identity showing softmax attention functions as Boltzmann-weighted aggregation over Euclidean distances with constant key norms, rendering it blind to key magnitude due to the omission of a specific norm term. This framework predicts that any router utilizing a metric ill-matched to its representations will compensate by concentrating routing and collapsing the routed representations. The authors validate this hypothesis across diverse models including nine pretrained transformers, graph attention networks, selective state-space models, recurrent mixers, and learned residual layers. Experimental results confirm that all tested architectures exhibit this identical signature of collapse regardless of their specific domain or structure. Furthermore, within-model ablations isolate the routing mechanism as the primary cause rather than incidental training dynamics. The onset of this phenomenon is shown to be contingent on the strength of the positional brake accompanying the content score, which can shift the effect across its range. However, the underlying mechanism remains invariant and does not require norm stratification, as routers with norm-normalized keys exhibit the same concentration behavior.

media r/LocalLLaMA · 4h ago

User Reports Strong Performance of siq1 Model on Kebab Bench

A Reddit user has shared results indicating that their model, referred to as siq1, performs very well on the Kebab Bench evaluation. The post highlights the model's capabilities through a demonstration hosted on Hugging Face Spaces. Specifically, the user points to a space titled 'hermes-agent-zerogpu' created by AlexWortega as evidence of this performance. This submission was made by the Reddit user /u/Mysterious_Hearing14 within the r/LocalLLaMA community. The original post includes a link to the Hugging Face interface where the model can be tested. Additionally, a video demonstration is available via a provided V.redd.it link for further verification.

media r/LocalLLaMA · 4h ago

Inquiry Regarding the Availability of Modern Non-Chat Completion Models

A user on the LocalLLaMA subreddit questioned whether all modern large language models are exclusively tuned for chat interactions. The inquiry specifically sought to identify any models that support bare text completion rather than conversational formats. The poster noted a difficulty in locating such models within the Hugging Face repository. This highlights a perceived gap in the availability of non-chat architectures for users requiring raw completion capabilities. The discussion reflects broader concerns about the industry's shift toward instruction-tuned and chat-oriented model designs.

arxiv arXiv cs.LG · 4h ago

No Reference-Free Generalization in Quantum Machine Learning

This study addresses the identifiability problem in quantum machine learning where training data lacks a preferred basis or reference frame. The authors formulate supervised learning without an external quantum reference frame, requiring classifiers to preserve unitary symmetries unbroken by the training data. They prove that if training states do not span the full Hilbert space, all pure states orthogonal to this span receive identical predictions. This limitation arises from missing reference information rather than state discrimination or computational constraints. The research establishes a robust version under weak symmetry breaking and shows that learning generic concepts requires exponentially many oriented training directions. Numerical illustrations visualize the resulting prediction collapse and its controlled relaxation. The results identify feature maps, measurement bases, and diverse training states as essential operational resources for generalization.

arxiv arXiv cs.LG · 5h ago

Wearable A-Mode Ultrasound Enables Whole Hand Kinematic Tracking on Microcontroller

Researchers propose a framework for robust whole-hand and wrist kinematic tracking using the wearable WULPUS platform with an A-mode ultrasound probe. The system addresses the regression of 23 degrees of freedom directly on the device, overcoming limitations of prior non-wearable systems. A compact multi-output convolutional neural network containing 11,285 parameters is employed alongside an incremental training strategy to enhance generalization. This approach reduces mean absolute error by more than 17% compared to non-incremental methods. The model is deployed on the WULPUS nRF52832 microcontroller, achieving end-to-end tracking entirely on-device. Inference consumes only 0.73 mJ with a latency of 29.1 ms. The system supports full operation within 33 mW, enabling up to 36 hours of continuous use. This method also reduces wireless bandwidth requirements by 88% compared to raw data transmission.

arxiv arXiv cs.LG · 5h ago

Null-Calibrated Conformal Selection via Target-Membership Scores

The article introduces Null-Calibrated Conformal Selection (NCCS), a method that utilizes target-membership probability scores to identify test candidates within a target region while controlling the false discovery rate. The authors argue that these membership scores provide a more natural ranking for selection tasks than conventional prediction-oriented nonconformity scores, particularly for complex targets. This distinction is critical for interval-valued, variance-driven, multimodal, or multi-condition targets where traditional scores may be misaligned with selection power. NCCS ranks test scores against confirmed non-target calibration examples to yield finite-sample valid null p-values under null exchangeability. These p-values can be combined with the Benjamini-Yekutieli procedure under arbitrary dependence or the Benjamini-Hochberg procedure under standard positive-dependence conditions. Experiments demonstrate that membership scores match conventional scores on mean-monotone targets but substantially improve performance on variance-driven targets. In rare-target regimes, NCCS trades power for finite-sample null validity, addressing issues where direct empirical-FDP thresholding can be anti-conservative.

arxiv arXiv cs.LG · 5h ago

RoboMME-Interference Benchmarks Robot Memory Under Distraction

The introduction of RoboMME-Interference addresses the need for evaluating robot memory in realistic, long-context scenarios where systems must recall information from multiple sessions ago. This new cross-session benchmark is built upon the existing RoboMME framework to measure performance when robots face distractions from unrelated prior experiences. For each query episode, the benchmark constructs a session history consisting of relevant demonstrations followed by a controlled number of unrelated sessions provided as memory to Vision-Language-Action models. Researchers tested released memory-augmented variants of the π_0.5 model without modification to assess their robustness under these conditions. The results indicate that while perceptual memory variants improve success rates when no distractors are present, their accuracy decays steadily and strongly as unrelated sessions accumulate. These findings highlight a critical failure in current systems regarding long-context memory and interference resistance. The project page, videos, code, and data for this benchmark are available at https://robotmemorybench.com.

arxiv arXiv cs.LG · 5h ago

Flow Annealing Posterior Sampling for Function-Space Regression and Inverse Problems

The authors introduce Flow Annealing Posterior Sampling (FAPS), a novel framework that unifies stochastic-process regression with PDE inverse problems in function space. Built upon pretrained function-space flow-matching priors, FAPS facilitates likelihood-guided posterior inference using sparse and noisy observations. The method supports variable query discretizations and avoids the need for explicit prior-density evaluation during sampling. It employs a Langevin correction mechanism that utilizes a low-rank covariance preconditioner to exploit dominant function-space correlations across different discretizations. Benchmarks on both Gaussian and non-Gaussian stochastic processes demonstrate that FAPS produces coherent posterior samples with accurate uncertainty quantification. The approach significantly outperforms existing functional regression baselines in these standard tasks. Furthermore, it achieves competitive or superior performance in noisy PDE inverse problems compared to diffusion-based samplers while reducing test-time sampling costs.

media r/LocalLLaMA · 5h ago

Backtrack Sampler and Verifier Drastically Improve Tiny Model Coding Performance

A new backtrack sampler combined with a verifier model significantly enhances the coding performance of tiny 0.5B parameter models, potentially making them competitive with larger 2-4B class models without weight changes. The approach theoretically addresses hallucination issues in large models by correcting errors during generation through re-sampling. However, this method incurs a 5-30% decode speed penalty due to the need for backward passes and requires training a verifier model of similar size to the original. This requirement doubles VRAM usage and increases compute demands by 1.5 to 3 times compared to standard inference. Despite these costs, the verifier generalizes across models of equal or lower weight classes if trained on diverse data distributions. Training the verifier is highly efficient, requiring only approximately 0.01% of the token size used for full pre-training.

media r/LocalLLaMA · 5h ago

NVIDIA Releases Nemotron-TwoTower-30B-A3B, a Diffusion-Based Language Model

NVIDIA has released the Nemotron-TwoTower-30B-A3B-Base-BF16 model, which is built upon the Nemotron 3 Nano 30B-A3B backbone. This architecture diverges from standard autoregressive models by utilizing a frozen context tower alongside a diffusion denoiser tower. The system iteratively fills blocks of tokens in parallel rather than generating them strictly one at a time. According to NVIDIA, this default mask-diffusion setup retains 98.7% of the aggregate benchmark quality found in the autoregressive baseline. Despite maintaining high quality, the model achieves 2.42 times its wall-clock generation throughput. The release highlights a novel approach to language modeling that combines diffusion techniques with large-scale language capabilities.

media r/LocalLLaMA · 5h ago

Experimental USB4 RDMA Implementation Demonstrated on Strix Halo

A blog post from Hellas.ai details an experimental implementation of Remote Direct Memory Access (RDMA) over Thunderbolt. The demonstration was conducted using two devices equipped with AMD's Strix Halo processors. This approach allows for high-speed data transfer capabilities via the USB4 standard. The author notes that this technology could be significant because it is compatible with any host supporting USB4. No prior public discussion of this specific implementation was found by the submitter. The work highlights the potential for leveraging existing hardware interfaces for advanced networking tasks.

media r/LocalLLaMA · 5h ago

GLM 5.2 on Dual Strix Halo (256GB): Worth it?

A Reddit user named Intrepid_Rub_3566 has shared a video review evaluating the performance of GLM 5.2 running on a dual AMD Strix Halo setup with 256GB of RAM. The discussion centers on whether this specific hardware configuration provides sufficient value for local large language model inference. The content highlights the technical feasibility of deploying GLM 5.2 in such an environment, focusing on resource utilization and speed. Viewers are directed to a YouTube link for detailed benchmarks and performance metrics. The thread also includes community comments discussing the practicality and cost-effectiveness of this dual-GPU approach.

media r/LocalLLaMA · 5h ago

Reddit Inquiry on Using Local Models for Self-Hacking

A user on the r/LocalLLaMA subreddit asked if anyone has attempted to gain root access to their own system using a local large language model. This inquiry was prompted by recent discussions regarding Mythos's alleged ability to hack into US government systems. The post seeks practical experiences from the community regarding the feasibility of such actions. It specifically targets the application of local models for self-penetration testing or unauthorized access. The question highlights concerns about the security implications of powerful AI tools in the hands of individuals.

media r/LocalLLaMA · 5h ago

User Reports Inferior Quality and Efficiency with MTP Models in Qwen 3.6 and Gemma 4

A user testing self-hosted Qwen 3.6 27B and Gemma 4 models on four RTX 5070 Ti cards reports that Multi-Token Prediction (MTP) degrades output quality compared to non-MTP variants. In code review tasks, the non-MTP model produced more detailed findings with fix suggestions while consuming fewer tokens than its MTP counterpart. Performance metrics showed the non-MTP setup achieving approximately 2000 prompt processing tokens per second and 50-60 token generation speed. Conversely, the MTP configuration yielded higher generation speeds of 100-120 tg/s but lower prompt processing rates around 1300 pp/s. Despite the higher generation throughput, real-world agent task completion times were only about 20% faster with MTP due to increased context consumption. The user utilized llama.cpp with specific GGUF files from Unsloth and noted similar negative experiences when testing Gemma 4.

media r/LocalLLaMA · 6h ago

Developer Requests Testing for MTP Support in GLM-4.7-Flash via llama.cpp

A developer is seeking community assistance to test Multi-Token Prediction (MTP) support for the GLM-4.7-Flash model within the llama.cpp framework. The author acknowledges that previous models like GLM Air and GLM Flash are outdated but expresses a personal interest in enabling MTP for them. The request specifically targets users who possess the necessary hardware to run GLM-4.7-Flash and have the technical ability to compile llama.cpp from source. Participants are asked to evaluate the functionality of the provided GGUF model and report any encountered issues. Additionally, testers are requested to measure and share the performance speed gains achieved through MTP implementation. The developer has uploaded the test model to a Hugging Face repository for immediate access. Users requiring smaller quantization options are invited to contact the author directly for alternative versions.

media r/LocalLLaMA · 6h ago

Question on why ROCm and Intel stacks lag behind CUDA in software ecosystem maturity

The author questions why software ecosystems for AMD's ROCm and Intel have failed to rapidly improve to match NVIDIA's CUDA. It is argued that until competing vendors' software catches up, NVIDIA will continue to charge a massive premium for its convenient products. The poster identifies as a user of both NVIDIA and Apple Silicon hardware for AI development. They express a desire for more affordable prices within the market. The argument suggests that price reductions will only occur when genuine competition exists. This perspective highlights the current dominance of CUDA in the AI hardware landscape.

media r/LocalLLaMA · 6h ago

Community Discussion on Running DeepSeek V4 Flash with MoE Offload

A Reddit user inquired about the feasibility of running the DeepSeek V4 Flash model using Mixture of Experts offload techniques. The poster noted that previous attempts to fit the desired model and its KV cache into VRAM required an additional 5-10GB of memory headroom. They highlighted several community resources, including a GGUF version of the model available on Hugging Face from the huihui-ai team. Additionally, the user pointed to a fork of antirez's repository that introduces tensor parallelism and socket enhancements for improved performance. The discussion also referenced Fringe's specific implementation designed for DeepSeek V4 Flash CUDA support. Consequently, the user considered compiling the model and downloading the nearly 100GB file to test these offloading capabilities.

media r/LocalLLaMA · 6h ago

Anthropic Accuses Alibaba of Illicit AI Capability Extraction Campaign

Anthropic has formally accused Alibaba of conducting a campaign to brazenly and illicitly extract capabilities from its artificial intelligence models. The company alleges that this activity involved unauthorized access methods designed to bypass standard security protocols. These accusations highlight growing concerns regarding the protection of proprietary machine learning technologies in the competitive AI sector. Reports indicate that the alleged extraction efforts were systematic rather than incidental. This dispute underscores the intensifying rivalry between major tech firms over advanced model development. The specific technical details of the extraction methods remain under investigation by both parties.

media r/LocalLLaMA · 6h ago

SupraWeather-Nano-Preview: A Small FT-Transformer for Weather Classification

SupraLabs has released SupraWeather-Nano, a preview model designed to classify weather phenomena from raw tabular meteorological data. The architecture utilizes a dedicated Feature Tokenizer and Transformer Encoder, where each input feature receives its own learned token that is aggregated by a CLS token before processing through a small transformer stack. This approach eliminates the need for text inputs or system prompts, allowing users to directly input numerical values to receive a classification result. The model accepts nine specific inputs: temperature, humidity, pressure, pressure trend, wind speed, wind direction, altitude, month, and air mass. It was trained entirely on a synthetic dataset generated by rule-based methods containing 120,000 samples. SupraLabs notes that this is an architecture experiment rather than a tool for real-world forecasting, with five out of six internal stress tests passing successfully.

arxiv arXiv cs.CL · 6h ago

HIPE-2026: Person-Place Relation Extraction from Multilingual Historical Texts

The HIPE-2026 campaign addresses the challenge of extracting person-place relations from noisy, multilingual historical documents. Moving beyond previous editions focused on named entity recognition, this third iteration targets temporally grounded relationships labeled as 'at' and 'isAt'. The evaluation involved 17 participating teams processing data in French, German, and English across three distinct datasets. These datasets comprised nineteenth and twentieth-century newspaper text alongside a surprise domain set of early modern French literary works. A key feature of the campaign was its three-fold framework assessing predictive accuracy, computational efficiency, and cross-domain generalization. Results from over 40 submitted runs demonstrated a wide variety of strategies, ranging from large language models to lightweight classifiers. The findings highlight the inherent trade-offs between accuracy, efficiency, and robustness in large-scale historical relation extraction.