All articles — korshunov.ai

All articles Page 1 / 93

Backtrack Sampler and Verifier Drastically Improve Tiny Model Coding Performance

A new backtrack sampler combined with a verifier model significantly enhances the coding performance of tiny 0.5B parameter models, potentially making them competitive with larger 2-4B class models without weight changes. The approach theoretically addresses hallucination issues in large models by correcting errors during generation through re-sampling. However, this method incurs a 5-30% decode speed penalty due to the need for backward passes and requires training a verifier model of similar size to the original. This requirement doubles VRAM usage and increases compute demands by 1.5 to 3 times compared to standard inference. Despite these costs, the verifier generalizes across models of equal or lower weight classes if trained on diverse data distributions. Training the verifier is highly efficient, requiring only approximately 0.01% of the token size used for full pre-training.

media r/LocalLLaMA · 3h ago

NVIDIA Releases Nemotron-TwoTower-30B-A3B, a Diffusion-Based Language Model

NVIDIA has released the Nemotron-TwoTower-30B-A3B-Base-BF16 model, which is built upon the Nemotron 3 Nano 30B-A3B backbone. This architecture diverges from standard autoregressive models by utilizing a frozen context tower alongside a diffusion denoiser tower. The system iteratively fills blocks of tokens in parallel rather than generating them strictly one at a time. According to NVIDIA, this default mask-diffusion setup retains 98.7% of the aggregate benchmark quality found in the autoregressive baseline. Despite maintaining high quality, the model achieves 2.42 times its wall-clock generation throughput. The release highlights a novel approach to language modeling that combines diffusion techniques with large-scale language capabilities.

media r/LocalLLaMA · 4h ago

Experimental USB4 RDMA Implementation Demonstrated on Strix Halo

A blog post from Hellas.ai details an experimental implementation of Remote Direct Memory Access (RDMA) over Thunderbolt. The demonstration was conducted using two devices equipped with AMD's Strix Halo processors. This approach allows for high-speed data transfer capabilities via the USB4 standard. The author notes that this technology could be significant because it is compatible with any host supporting USB4. No prior public discussion of this specific implementation was found by the submitter. The work highlights the potential for leveraging existing hardware interfaces for advanced networking tasks.

media r/LocalLLaMA · 4h ago

GLM 5.2 on Dual Strix Halo (256GB): Worth it?

A Reddit user named Intrepid_Rub_3566 has shared a video review evaluating the performance of GLM 5.2 running on a dual AMD Strix Halo setup with 256GB of RAM. The discussion centers on whether this specific hardware configuration provides sufficient value for local large language model inference. The content highlights the technical feasibility of deploying GLM 5.2 in such an environment, focusing on resource utilization and speed. Viewers are directed to a YouTube link for detailed benchmarks and performance metrics. The thread also includes community comments discussing the practicality and cost-effectiveness of this dual-GPU approach.

media r/LocalLLaMA · 4h ago

Reddit Inquiry on Using Local Models for Self-Hacking

A user on the r/LocalLLaMA subreddit asked if anyone has attempted to gain root access to their own system using a local large language model. This inquiry was prompted by recent discussions regarding Mythos's alleged ability to hack into US government systems. The post seeks practical experiences from the community regarding the feasibility of such actions. It specifically targets the application of local models for self-penetration testing or unauthorized access. The question highlights concerns about the security implications of powerful AI tools in the hands of individuals.

media r/LocalLLaMA · 4h ago

User Reports Inferior Quality and Efficiency with MTP Models in Qwen 3.6 and Gemma 4

A user testing self-hosted Qwen 3.6 27B and Gemma 4 models on four RTX 5070 Ti cards reports that Multi-Token Prediction (MTP) degrades output quality compared to non-MTP variants. In code review tasks, the non-MTP model produced more detailed findings with fix suggestions while consuming fewer tokens than its MTP counterpart. Performance metrics showed the non-MTP setup achieving approximately 2000 prompt processing tokens per second and 50-60 token generation speed. Conversely, the MTP configuration yielded higher generation speeds of 100-120 tg/s but lower prompt processing rates around 1300 pp/s. Despite the higher generation throughput, real-world agent task completion times were only about 20% faster with MTP due to increased context consumption. The user utilized llama.cpp with specific GGUF files from Unsloth and noted similar negative experiences when testing Gemma 4.

media r/LocalLLaMA · 4h ago

Developer Requests Testing for MTP Support in GLM-4.7-Flash via llama.cpp

A developer is seeking community assistance to test Multi-Token Prediction (MTP) support for the GLM-4.7-Flash model within the llama.cpp framework. The author acknowledges that previous models like GLM Air and GLM Flash are outdated but expresses a personal interest in enabling MTP for them. The request specifically targets users who possess the necessary hardware to run GLM-4.7-Flash and have the technical ability to compile llama.cpp from source. Participants are asked to evaluate the functionality of the provided GGUF model and report any encountered issues. Additionally, testers are requested to measure and share the performance speed gains achieved through MTP implementation. The developer has uploaded the test model to a Hugging Face repository for immediate access. Users requiring smaller quantization options are invited to contact the author directly for alternative versions.

media r/LocalLLaMA · 4h ago

Question on why ROCm and Intel stacks lag behind CUDA in software ecosystem maturity

The author questions why software ecosystems for AMD's ROCm and Intel have failed to rapidly improve to match NVIDIA's CUDA. It is argued that until competing vendors' software catches up, NVIDIA will continue to charge a massive premium for its convenient products. The poster identifies as a user of both NVIDIA and Apple Silicon hardware for AI development. They express a desire for more affordable prices within the market. The argument suggests that price reductions will only occur when genuine competition exists. This perspective highlights the current dominance of CUDA in the AI hardware landscape.

media r/LocalLLaMA · 4h ago

Community Discussion on Running DeepSeek V4 Flash with MoE Offload

A Reddit user inquired about the feasibility of running the DeepSeek V4 Flash model using Mixture of Experts offload techniques. The poster noted that previous attempts to fit the desired model and its KV cache into VRAM required an additional 5-10GB of memory headroom. They highlighted several community resources, including a GGUF version of the model available on Hugging Face from the huihui-ai team. Additionally, the user pointed to a fork of antirez's repository that introduces tensor parallelism and socket enhancements for improved performance. The discussion also referenced Fringe's specific implementation designed for DeepSeek V4 Flash CUDA support. Consequently, the user considered compiling the model and downloading the nearly 100GB file to test these offloading capabilities.

media r/LocalLLaMA · 4h ago

Anthropic Accuses Alibaba of Illicit AI Capability Extraction Campaign

Anthropic has formally accused Alibaba of conducting a campaign to brazenly and illicitly extract capabilities from its artificial intelligence models. The company alleges that this activity involved unauthorized access methods designed to bypass standard security protocols. These accusations highlight growing concerns regarding the protection of proprietary machine learning technologies in the competitive AI sector. Reports indicate that the alleged extraction efforts were systematic rather than incidental. This dispute underscores the intensifying rivalry between major tech firms over advanced model development. The specific technical details of the extraction methods remain under investigation by both parties.

media r/LocalLLaMA · 4h ago

SupraWeather-Nano-Preview: A Small FT-Transformer for Weather Classification

SupraLabs has released SupraWeather-Nano, a preview model designed to classify weather phenomena from raw tabular meteorological data. The architecture utilizes a dedicated Feature Tokenizer and Transformer Encoder, where each input feature receives its own learned token that is aggregated by a CLS token before processing through a small transformer stack. This approach eliminates the need for text inputs or system prompts, allowing users to directly input numerical values to receive a classification result. The model accepts nine specific inputs: temperature, humidity, pressure, pressure trend, wind speed, wind direction, altitude, month, and air mass. It was trained entirely on a synthetic dataset generated by rule-based methods containing 120,000 samples. SupraLabs notes that this is an architecture experiment rather than a tool for real-world forecasting, with five out of six internal stress tests passing successfully.

arxiv arXiv cs.CL · 4h ago

HIPE-2026: Person-Place Relation Extraction from Multilingual Historical Texts

The HIPE-2026 campaign addresses the challenge of extracting person-place relations from noisy, multilingual historical documents. Moving beyond previous editions focused on named entity recognition, this third iteration targets temporally grounded relationships labeled as 'at' and 'isAt'. The evaluation involved 17 participating teams processing data in French, German, and English across three distinct datasets. These datasets comprised nineteenth and twentieth-century newspaper text alongside a surprise domain set of early modern French literary works. A key feature of the campaign was its three-fold framework assessing predictive accuracy, computational efficiency, and cross-domain generalization. Results from over 40 submitted runs demonstrated a wide variety of strategies, ranging from large language models to lightweight classifiers. The findings highlight the inherent trade-offs between accuracy, efficiency, and robustness in large-scale historical relation extraction.

arxiv arXiv cs.CL · 5h ago

Weave of Formal Thought: Uniting Rigorous Syntactic Validation with Learned Structural Representations

The authors introduce Weave of Formal Thought (WoFT), a paradigm combining rigorous syntactic validation with learned structural representations for code generation. The approach utilizes a formal engine and constrained decoder that is sound and complete regarding the full Tree-sitter specification. By augmenting generalized LR parsing with speculative lexing, the system maintains concurrent lexer-state hypotheses to admit valid program prefixes while rejecting invalid ones. Additionally, WoFT employs latent-variable fine-tuning to train models to interleave non-terminal grammar symbols directly into the generation process. This method uses the reweighted wake-sleep algorithm to optimize the importance-weighted evidence lower bound of the surface text. The model learns to selectively retain formal derivations as an adaptive structural scratchpad during inference. Experiments on Python show that fine-tuning StarCoder2-3B with this objective reduces per-token cross-entropy by 14.3% compared to a text-only baseline.

github llama.cpp · 5h ago

llama.cpp b9788 adds SYCL tensor parallelism for dual-GPU setups

The llama.cpp release b9788 introduces support for tensor parallelism via the --split-mode tensor flag in the SYCL backend. This implementation enables dual-GPU communication by adding comm_init, comm_free, and comm_allreduce_tensor functions to the meta-backend. For two devices, it uses a ring all-reduce strategy that switches between FP32 direct memcpy for small tensors and BF16 compression for larger ones. The code avoids OneCCL due to its single-device-per-process limitation, instead using persistent buffers to maintain SYCL pool invariants. Performance tests on dual Intel Arc Pro B70 GPUs show significant speedups over layer mode for Llama-3.3-70B and Qwen3-Coder-Next-80B-A3B models. The update includes new binaries for macOS, Linux, Windows, Android, and openEuler across CPU, CUDA, ROCm, Vulkan, and SYCL targets.

github llama.cpp · 5h ago

llama.cpp b9789 Release Fixes MoE Quantization and Provides Multi-Platform Binaries

The llama.cpp project has released version b9789, which includes a critical fix for quantizing Mixture of Experts (MoE) models with multi-token prediction. This update addresses issues identified in pull request #24986 to ensure proper handling of these specific model architectures. The release provides pre-built binaries for macOS Apple Silicon and Intel, as well as an iOS XCFramework. Linux users can download builds for Ubuntu across CPU, Vulkan, ROCm 7.2, OpenVINO, and SYCL backends. Windows support includes CPU, CUDA 12.4 and 13.3, Vulkan, OpenVINO, SYCL, and HIP variants. Additional platforms such as Android arm64 and openEuler are also supported with specific hardware configurations.

arxiv arXiv cs.CL · 5h ago

SpeechEQ: Benchmarking Emotional Intelligence in Socially Aware Voice Conversational Models

The authors introduce SpeechEQ, a comprehensive framework designed to evaluate the sociolinguistic reasoning of Speech-Language Models. Existing evaluations often overlook the complex cross-modal reasoning required for active dialogue by relying on isolated text or passive acoustic perception. The framework includes a validated dataset of 2,265 dialogues across 15 Emotional Quotient subscales grounded in EQ-i 2.0 theory. It also features a multi-turn evaluation protocol measured by the proposed Spoken EQ score, which is inspired by human EQ assessments. Experiments reveal limitations in how both Speech Emotion Recognition and end-to-end models understand paralinguistic cues through speech. While end-to-end architectures outperform cascaded systems, current multimodal models remain bottlenecked by several specific issues. These barriers include a text-reliant modality shortcut, an alignment-induced safety trap, and contextual amnesia.

arxiv arXiv cs.CL · 5h ago

Autodata: An agentic data scientist to create high quality synthetic data

The authors introduce Autodata, a general method that enables AI agents to function as data scientists for building high-quality training and evaluation datasets. The approach involves meta-optimizing these agents so they learn to generate increasingly stronger data through a process called Agentic Self-Instruct. Experiments were conducted across computer science research tasks, legal reasoning, and mathematical object reasoning. Results demonstrate that this agentic creation method yields improved performance compared to classical synthetic dataset creation techniques. Furthermore, the meta-optimization of the data scientist agent itself delivers an even larger performance uplift. This work illustrates how increased inference compute can be converted into higher quality model training data. The authors suggest this direction has the potential to fundamentally change how AI data is built.

arxiv arXiv cs.CL · 5h ago

Dziri Voicebot: End-to-End Speech-to-Speech System for Algerian Dialect

The paper introduces Dziri Voicebot, an end-to-end speech-to-speech conversational system designed for the low-resource Algerian Dialect. This work extends previous text-based dialogue modeling efforts by Bechiri and Lanasri to full speech-based interaction. The proposed modular pipeline integrates automatic speech recognition, natural language understanding, retrieval-augmented generation, and text-to-speech synthesis. Dedicated datasets were constructed for the telecom domain to fine-tune pretrained models for each component. The ASR system utilizes Whisper-based adaptation, while the NLU module combines transformer embeddings with a task-oriented dialogue framework. A neural TTS system was trained on a newly collected dialectal corpus to enable spoken response generation. Experimental results demonstrate strong performance across all components, including low word error rates and high intent classification scores.

lab OpenAI News · 5h ago

OpenAI Research Shows AI Agents Transforming Work

A new research paper from OpenAI demonstrates how artificial intelligence agents are fundamentally changing the nature of work. The study highlights the capability of these agents to execute longer and more complex tasks than previously possible. This technological advancement is credited with expanding productivity across a wide variety of professional roles. The findings suggest a significant shift in how labor is organized and performed through automation. By handling intricate workflows, AI agents are enabling users to achieve greater efficiency. The paper serves as evidence of the growing impact of autonomous systems on modern employment.

arxiv arXiv cs.CL · 5h ago

Tatoxa: A Novel Text Detoxification System for Low-Resource Tatar

The paper introduces Tatoxa, a state-of-the-art system designed for automated text detoxification in the low-resource language of Tatar. This work addresses the lack of research attention given to abusive content mitigation in languages with limited digital resources. The authors present a new dataset specifically created for fine-tuning and evaluating detoxification models within these constrained settings. Comparative experiments demonstrate that Tatoxa outperforms both existing open-source and proprietary commercial large language models on key quality metrics. Furthermore, the study investigates cross-lingual transfer capabilities to assess the viability of using data from other languages. Results indicate that training on native Tatar data is significantly more effective than transferring knowledge from culturally close languages like Russian. Even when a large Russian corpus is available, cross-lingual approaches perform worse than models trained exclusively on native Tatar text.