All articles
media r/LocalLLaMA · 7h ago

Anthropic Accuses Alibaba of Illicit AI Capability Extraction Campaign

Anthropic has formally accused Alibaba of conducting a campaign to brazenly and illicitly extract capabilities from its artificial intelligence models. The company alleges that this activity involved unauthorized access methods designed to bypass standard security protocols. These accusations highlight growing concerns regarding the protection of proprietary machine learning technologies in the competitive AI sector. Reports indicate that the alleged extraction efforts were systematic rather than incidental. This dispute underscores the intensifying rivalry between major tech firms over advanced model development. The specific technical details of the extraction methods remain under investigation by both parties.

media r/LocalLLaMA · 7h ago

SupraWeather-Nano-Preview: A Small FT-Transformer for Weather Classification

SupraLabs has released SupraWeather-Nano, a preview model designed to classify weather phenomena from raw tabular meteorological data. The architecture utilizes a dedicated Feature Tokenizer and Transformer Encoder, where each input feature receives its own learned token that is aggregated by a CLS token before processing through a small transformer stack. This approach eliminates the need for text inputs or system prompts, allowing users to directly input numerical values to receive a classification result. The model accepts nine specific inputs: temperature, humidity, pressure, pressure trend, wind speed, wind direction, altitude, month, and air mass. It was trained entirely on a synthetic dataset generated by rule-based methods containing 120,000 samples. SupraLabs notes that this is an architecture experiment rather than a tool for real-world forecasting, with five out of six internal stress tests passing successfully.

arxiv arXiv cs.CL · 7h ago

HIPE-2026: Person-Place Relation Extraction from Multilingual Historical Texts

The HIPE-2026 campaign addresses the challenge of extracting person-place relations from noisy, multilingual historical documents. Moving beyond previous editions focused on named entity recognition, this third iteration targets temporally grounded relationships labeled as 'at' and 'isAt'. The evaluation involved 17 participating teams processing data in French, German, and English across three distinct datasets. These datasets comprised nineteenth and twentieth-century newspaper text alongside a surprise domain set of early modern French literary works. A key feature of the campaign was its three-fold framework assessing predictive accuracy, computational efficiency, and cross-domain generalization. Results from over 40 submitted runs demonstrated a wide variety of strategies, ranging from large language models to lightweight classifiers. The findings highlight the inherent trade-offs between accuracy, efficiency, and robustness in large-scale historical relation extraction.

arxiv arXiv cs.CL · 7h ago

Weave of Formal Thought: Uniting Rigorous Syntactic Validation with Learned Structural Representations

The authors introduce Weave of Formal Thought (WoFT), a paradigm combining rigorous syntactic validation with learned structural representations for code generation. The approach utilizes a formal engine and constrained decoder that is sound and complete regarding the full Tree-sitter specification. By augmenting generalized LR parsing with speculative lexing, the system maintains concurrent lexer-state hypotheses to admit valid program prefixes while rejecting invalid ones. Additionally, WoFT employs latent-variable fine-tuning to train models to interleave non-terminal grammar symbols directly into the generation process. This method uses the reweighted wake-sleep algorithm to optimize the importance-weighted evidence lower bound of the surface text. The model learns to selectively retain formal derivations as an adaptive structural scratchpad during inference. Experiments on Python show that fine-tuning StarCoder2-3B with this objective reduces per-token cross-entropy by 14.3% compared to a text-only baseline.

github llama.cpp · 7h ago

llama.cpp b9788 adds SYCL tensor parallelism for dual-GPU setups

The llama.cpp release b9788 introduces support for tensor parallelism via the --split-mode tensor flag in the SYCL backend. This implementation enables dual-GPU communication by adding comm_init, comm_free, and comm_allreduce_tensor functions to the meta-backend. For two devices, it uses a ring all-reduce strategy that switches between FP32 direct memcpy for small tensors and BF16 compression for larger ones. The code avoids OneCCL due to its single-device-per-process limitation, instead using persistent buffers to maintain SYCL pool invariants. Performance tests on dual Intel Arc Pro B70 GPUs show significant speedups over layer mode for Llama-3.3-70B and Qwen3-Coder-Next-80B-A3B models. The update includes new binaries for macOS, Linux, Windows, Android, and openEuler across CPU, CUDA, ROCm, Vulkan, and SYCL targets.

github llama.cpp · 7h ago

llama.cpp b9789 Release Fixes MoE Quantization and Provides Multi-Platform Binaries

The llama.cpp project has released version b9789, which includes a critical fix for quantizing Mixture of Experts (MoE) models with multi-token prediction. This update addresses issues identified in pull request #24986 to ensure proper handling of these specific model architectures. The release provides pre-built binaries for macOS Apple Silicon and Intel, as well as an iOS XCFramework. Linux users can download builds for Ubuntu across CPU, Vulkan, ROCm 7.2, OpenVINO, and SYCL backends. Windows support includes CPU, CUDA 12.4 and 13.3, Vulkan, OpenVINO, SYCL, and HIP variants. Additional platforms such as Android arm64 and openEuler are also supported with specific hardware configurations.

arxiv arXiv cs.CL · 8h ago

SpeechEQ: Benchmarking Emotional Intelligence in Socially Aware Voice Conversational Models

The authors introduce SpeechEQ, a comprehensive framework designed to evaluate the sociolinguistic reasoning of Speech-Language Models. Existing evaluations often overlook the complex cross-modal reasoning required for active dialogue by relying on isolated text or passive acoustic perception. The framework includes a validated dataset of 2,265 dialogues across 15 Emotional Quotient subscales grounded in EQ-i 2.0 theory. It also features a multi-turn evaluation protocol measured by the proposed Spoken EQ score, which is inspired by human EQ assessments. Experiments reveal limitations in how both Speech Emotion Recognition and end-to-end models understand paralinguistic cues through speech. While end-to-end architectures outperform cascaded systems, current multimodal models remain bottlenecked by several specific issues. These barriers include a text-reliant modality shortcut, an alignment-induced safety trap, and contextual amnesia.

arxiv arXiv cs.CL · 8h ago

Autodata: An agentic data scientist to create high quality synthetic data

The authors introduce Autodata, a general method that enables AI agents to function as data scientists for building high-quality training and evaluation datasets. The approach involves meta-optimizing these agents so they learn to generate increasingly stronger data through a process called Agentic Self-Instruct. Experiments were conducted across computer science research tasks, legal reasoning, and mathematical object reasoning. Results demonstrate that this agentic creation method yields improved performance compared to classical synthetic dataset creation techniques. Furthermore, the meta-optimization of the data scientist agent itself delivers an even larger performance uplift. This work illustrates how increased inference compute can be converted into higher quality model training data. The authors suggest this direction has the potential to fundamentally change how AI data is built.

arxiv arXiv cs.CL · 8h ago

Dziri Voicebot: End-to-End Speech-to-Speech System for Algerian Dialect

The paper introduces Dziri Voicebot, an end-to-end speech-to-speech conversational system designed for the low-resource Algerian Dialect. This work extends previous text-based dialogue modeling efforts by Bechiri and Lanasri to full speech-based interaction. The proposed modular pipeline integrates automatic speech recognition, natural language understanding, retrieval-augmented generation, and text-to-speech synthesis. Dedicated datasets were constructed for the telecom domain to fine-tune pretrained models for each component. The ASR system utilizes Whisper-based adaptation, while the NLU module combines transformer embeddings with a task-oriented dialogue framework. A neural TTS system was trained on a newly collected dialectal corpus to enable spoken response generation. Experimental results demonstrate strong performance across all components, including low word error rates and high intent classification scores.

lab OpenAI News · 8h ago

OpenAI Research Shows AI Agents Transforming Work

A new research paper from OpenAI demonstrates how artificial intelligence agents are fundamentally changing the nature of work. The study highlights the capability of these agents to execute longer and more complex tasks than previously possible. This technological advancement is credited with expanding productivity across a wide variety of professional roles. The findings suggest a significant shift in how labor is organized and performed through automation. By handling intricate workflows, AI agents are enabling users to achieve greater efficiency. The paper serves as evidence of the growing impact of autonomous systems on modern employment.

arxiv arXiv cs.CL · 8h ago

Tatoxa: A Novel Text Detoxification System for Low-Resource Tatar

The paper introduces Tatoxa, a state-of-the-art system designed for automated text detoxification in the low-resource language of Tatar. This work addresses the lack of research attention given to abusive content mitigation in languages with limited digital resources. The authors present a new dataset specifically created for fine-tuning and evaluating detoxification models within these constrained settings. Comparative experiments demonstrate that Tatoxa outperforms both existing open-source and proprietary commercial large language models on key quality metrics. Furthermore, the study investigates cross-lingual transfer capabilities to assess the viability of using data from other languages. Results indicate that training on native Tatar data is significantly more effective than transferring knowledge from culturally close languages like Russian. Even when a large Russian corpus is available, cross-lingual approaches perform worse than models trained exclusively on native Tatar text.

arxiv arXiv cs.CL · 8h ago

Multi-Step Tool-Use RL Collapse and Supervisory Fixes

Recent agentic reinforcement learning methods for large language models often suffer from instability or limited gains in tool-use tasks. Experiments reveal that some models experience catastrophic collapse, where performance drops abruptly and tool-invocation structures fail. Analysis shows these failures stem from unexpected probability spikes in specific control tokens that disrupt structured execution. Despite this disruption, the underlying tool-use capability remains intact but is obscured by specific formatting issues. To address this, the study investigates diverse supervisory signals including off-policy supervision and hint-based guidance under various training schemes. The authors find that interleaving supervised fine-tuning with reinforcement learning substantially improves stability during training. However, this approach exhibits degraded performance when evaluated on format and content out-of-distribution data. The results highlight the importance of understanding RL failures to enable robust training for complex multi-step tool-use tasks.

arxiv arXiv cs.CL · 8h ago

Detect, Unlearn, Restore: Defending Text Summarization Models Against Data Poisoning

The study addresses the threat of training-time data poisoning during fine-tuning for abstractive text summarization models. Adversaries manipulate small task-specific datasets to induce persistent summarization failures while maintaining standard evaluation metrics. A unified post-hoc defense framework is proposed to detect and remediate poisoning across the machine learning supply chain. In white-box settings, detection relies on influence-function analysis identifying abnormally high training influence in poisoned pairs. Black-box defenses utilize behavioral auditing based on increased sensitivity to semantics-preserving perturbations. The authors introduce novel attacks targeting factual distortion and representational bias that evade conventional alarms. Experiments across nine architectures and six benchmarks show 85-92% detection precision for the proposed defenses. Gradient-ascent unlearning restores up to 96% of original behavior with less than 0.6% ROUGE degradation.

arxiv arXiv cs.CL · 8h ago

Study finds readers prefer human over AI literary translations despite adequate machine quality

A recent study investigates reader preferences regarding AI versus human translations of literary works, noting that while automatic metrics often favor machine output, they fail to capture immersive and literary effects. Researchers asked 15 avid readers to compare human translations against those generated by an agentic LLM pipeline for 15 novels in French, Polish, and Japanese. The evaluation involved approximately 8K-word excerpts through both immersive reading of whole texts and close reading of aligned chunk pairs. Results showed that while readers found machine translations adequate, they significantly preferred human versions for their clarity and ease of immersion. Notably, participants could not reliably distinguish between the two types of translation and tended to favor whichever version they believed was human-made. To support future research, the authors released LAIT, a reader-centered dataset containing 1K comments, 2K judgments, and 7.2K span-level annotations.

arxiv arXiv cs.CL · 8h ago

Evaluating OCR-Reasoning Robustness of Vision-Language Models Under Visual Perturbations

The authors introduce OCR-Robust, a benchmark designed to evaluate the robustness of vision-language models during OCR reasoning tasks under visual perturbations. The dataset comprises 812 samples divided into two subsets: OCR1.0, which covers documents and handwriting, and OCR2.0, focusing on charts and tables. A pilot study identified five representative perturbation types at three severity levels to ensure efficient evaluation. The study benchmarks 18 models, including proprietary systems and open-source VLMs, using metrics like Relative Corruption Retention and Worst-Case Retention. Results indicate that higher clean accuracy does not necessarily correlate with stronger robustness against visual degradation. Furthermore, the analysis reveals that charts and tables are substantially more fragile than document-like inputs when subjected to these perturbations.

media Hugging Face Forums · 8h ago

Bro77XP Releases Beginner-Friendly Local AI VTuber with Zero-Shot Voice Cloning

Bro77XP has released a 100% local, free AI VTuber project designed for beginners and non-programmers. The system utilizes Whisper for real-time English speech recognition, Ollama with the llama3.2 model for LLM inference, and Chatterbox TTS for text-to-speech generation. It features instant zero-shot voice cloning and operates in a continuous listening loop that automatically detects silence to record only when speech is present. The software integrates with VTube Studio via its API to control mouth expressions and trigger emotion animations based on the generated responses. While initially developed on an AMD GPU, the code primarily supports CPU users, allowing operation without specific NVIDIA or AMD hardware. Setup requires Python 3.10.11 and involves creating a virtual environment to install core dependencies like openai-whisper, pyaudio, and websocket-client.

arxiv arXiv cs.CL · 9h ago

Natural Ungrokking: Asymmetric Control of Which Rules Survive Pretraining

A study identifies 'natural ungrokking,' a phenomenon where small language models lose learned grammatical rules midway through pretraining despite the evidence remaining in the data. Researchers observed that a model learning pronoun-gender agreement with Sue collapsed from 0.94 accuracy to near zero by step 3,500 without any corresponding spike in the loss curve. The survival of these rules is determined by support frequency within the training stream, while the data-to-parameter ratio only modulates the depth of the collapse. This emergence-then-collapse dynamic was replicated across multiple corpora, budgets, and seeds, and confirmed in public Pythia checkpoints where collapse depth correlated with model scale. The forgetting process acts as a displacement mechanism where a competing surface pattern out-competes the rule, causing the log-probability margin to cross zero within 100 steps of behavioral failure. Control over this fate is asymmetric; while injecting counter-evidence can destroy rules via a monotone dose-response, restoring support even at 450 times the sustaining level fails to recover them.

arxiv arXiv cs.CL · 9h ago

Keyword Lexicon Blindness Distorts Rhetorical Stance Measurement

A study analyzing 85 interviews with four public intellectuals reveals that keyword-based scoring can produce statistical artifacts regarding rhetorical stance. Initial analysis showed a robust negative-affect and emphatic-certainty co-occurrence pattern with high correlation coefficients ranging from r = 0.72 to 0.93. However, replacing this method with LLM-based zero-shot semantic classification on the full diarized corpus of 32,625 sentences significantly reduced these correlations. For instance, Dalio's correlation dropped from 0.851 to 0.206, while other speakers exhibited negative or null relationships between negativity and certainty. In contrast, the LLM analysis revealed a strong coupling between negative sentiment and hedging language, aligning with conventional expectations of pessimistic discourse. The discrepancy stems from three structural failures in keyword lexicons: syntactic blindness, polysemy blindness, and categorical absence. These flaws can invert semantic meaning, such as scoring 'never absolutely totally confident' as high certainty. The authors argue that keyword counts measure lexical co-occurrence tendencies rather than epistemic certainty, constituting a category error.

arxiv arXiv cs.CL · 9h ago

Auditing Order Sensitivity in Multimodal Large Language Models

The study introduces Facet-Probe, a five-facet audit of 18 frontier and open-weight multimodal large language models to assess order sensitivity. Standard benchmarks often miss whether shuffling evidence changes answers, a reliability property highlighted by emerging AI evaluation guidelines. Using a Bayesian item-response model, the researchers separated ordering noise from per-facet bias and estimated decoder-stochastic floors via same-ordering controls. The audit found that none of the 18 models are order-invariant, with panel-mean flip rates spanning 24-50% across different facets. Even the best-performing model flipped its answer on 13.4% of trials, indicating that higher capability does not eliminate this vulnerability. Mitigation tests using training-free prompt changes proved modality-conditional and failed to transfer between text and visual reasoning tasks. These findings suggest that prompt-level fixes are insufficient for general order robustness, motivating architectural solutions. The authors propose cross-ordering flip rate as a standard reporting axis for future MLLM evaluations.

arxiv arXiv cs.CL · 9h ago

Real-Time Voice AI Hears but Does Not Listen

A study evaluates four leading production real-time voice systems: OpenAI's GPT Realtime 2, Google's Gemini 3.1 Flash Live, and Alibaba's Qwen3.5 Omni Plus and Omni Flash. The research focuses on tasks where both words and vocal delivery convey meaningful information across three consequential scenarios. All four systems act on the literal words rather than the voice, leading to errors such as ending calls with crying users who insist nothing is wrong or approving wire transfers made in frightened voices. Surprisingly, this disconnect is often not a failure of perception, as three of the four systems can reliably identify distress, fear, or sarcasm when asked directly. Despite this awareness, the models ignore these emotional cues during decision-making, exhibiting what the authors term the 'emotional intelligence gap.' The study also notes that systems estimate accent and age based on word biases rather than acoustic properties. Prompting the systems to explicitly attend to vocal delivery improves performance only partially and inconsistently. These findings suggest current real-time voice AI behaves as if speech were reduced to a transcript, warranting caution in settings where tone is critical.