Research paper
arxiv arXiv cs.LG · 16h ago

LLMs Determine Causal Structure via Difference-Making Logic

The article addresses the puzzle of how large language models acquire causal structure despite the limitations of standard formalisms like Judea Pearl's interventionist approach and the Neyman-Rubin framework. It argues that LLMs utilize a specific inductive method known as variational induction, which relies on difference-making logic. During training, models process vast amounts of text from diverse contexts to identify what constitutes a difference-maker or an indifference-maker within word sequences. The analysis examines how architectural components, specifically token embeddings and self-attention mechanisms, facilitate this variational induction process. This logical framework fundamentally parallels the experimental method used in science. In both cases, causal relations are derived by systematically varying individual circumstances to observe their influence on a phenomenon.

arxiv arXiv cs.LG · 16h ago

Escaping the Variance Trap: Jacobian-Free Dynamics for Root-Finding Bilevel Optimization

The authors identify a critical flaw termed the Variance Trap, which arises when stochastic root-finding problems are forced into minimization frameworks via squared residuals. Standard bilevel minimization algorithms require estimating hypergradients involving implicit Jacobians that act as noise amplifiers in stochastic settings. To address this, the paper formalizes Root-Finding Bilevel Optimization (RF-BO) as a distinct problem class to bypass this pathology. A Jacobian-free solution using Two-Time-Scale Stochastic Approximation (TTSA) is proposed to update directly along the root error. The study provides the first non-asymptotic convergence guarantees for TTSA in this setting under Markovian noise. Experiments show a 2.6% top-1 accuracy gain in SimCLR and 17x faster convergence in non-linear ODE control compared to baselines. Additionally, the framework achieves significantly improved entropy stability in reinforcement learning and an 11.1% quality improvement in generative modeling.

arxiv arXiv cs.LG · 16h ago

RQ-TTSA: Distribution-Aware Robust Bilevel Optimization with Quantile-Guided Huber Updates

The authors propose RQ-TTSA, a distribution-aware framework designed to address instability in bilevel optimization caused by heavy-tailed stochastic noise. Unlike existing variance-reduction techniques that rely on myopic magnitude checks, this method uses historical gradient buffers to estimate rolling quantiles for adaptive Huber-style clipping. This approach preserves local optimization geometry while strictly bounding effective variance under nonconvex-strongly convex assumptions with infinite-variance noise. Theoretical analysis derives a convergence rate of O(T^(-(p-1)/(3p-2))) that recovers optimal dependence on the heavy-tailed parameter p. Empirical evaluations across six diverse tasks, including vision benchmarks and offline reinforcement learning, show consistent outperformance over state-of-the-art baselines. RQ-TTSA eliminates divergence spikes and ensures stable convergence with negligible computational overhead of approximately 2.7 percent.

media r/LocalLLaMA · 17h ago

Colony: An Educational Simulation of LLM Attention Mechanisms Using Agent-Based Analogies

Colony is an educational resource designed to explain the attention mechanism of Large Language Models through simple analogies involving agents. The simulation places these agents within a board environment inspired by Conway's Game of Life. Each agent in the system represents a specific role within the self-attention block mechanism of an LLM. This visual approach allows users to observe how information flows and interacts during the attention process. The project is available as an open-source tool for those interested in exploring these concepts without complex mathematics. It serves as a fun and accessible way to understand the internal workings of transformer models.

arxiv arXiv cs.LG · 20h ago

AdaR: Adaptive Recurrent Message Passing for Graph Test-Time Computing

AdaR enables flexible test-time computing on graphs without parameter changes by using adaptive recurrence. It derives step dependence as a necessary and sufficient condition for convergence and incorporates normalized step information and representation-target relations into recurrent updates, guided by gradient-based supervision signals. Empirical results show AdaR outperforms strong baselines in both inductive and transductive graph learning settings.

arxiv arXiv cs.LG · 20h ago

Speech-Text Models Latently Transcribe Speech in Intermediate Layers

Interleaved speech-language models undergo an implicit transcription phase where spoken words become decodable as text tokens in intermediate layers, despite no speech recognition training. Up to 77% of the data shows the spoken word appearing as a top candidate text prediction, followed by a transition to text-based next-word prediction before returning to speech. This behavior is influenced by interleaved training and text LM initialization, and correlates with spoken knowledge performance.

arxiv arXiv cs.LG · 21h ago

The Scissors Effect: Resize Diversity Hurts Robust Surrogate Transfer

Input diversity, a common practice in transfer attacks, improves success on standard surrogates but reduces it on robust ones. This regime-dependent effect, called the Scissors Effect, is driven by gradient geometry, with resize operations degrading alignment in robust models. A training-free rule (CG-DI) adjusts diversity based on local gradient consistency to preserve attack success across surrogate types.

arxiv arXiv cs.LG · 21h ago

Generative Robust Optimisation Framework

Generative Robust Optimisation (GRO) introduces a deep generative model to define uncertainty sets, capturing nonlinear correlations, asymmetry, and multimodality. A five-point evaluation framework assesses neural network-based uncertainty sets across reconstruction fidelity, distribution matching, latent regularity, robust relevance, and computational tractability, with experiments validating GRO's effectiveness in production planning and facility location problems.

arxiv arXiv cs.LG · 21h ago

Introducing Quantum Measurement Temperature to Stabilize Hybrid QNN Training

A learnable scaling parameter called Quantum Measurement Temperature (QMT) is introduced to rescale quantum measurement outputs in hybrid quantum neural networks. This approach mitigates measurement-induced logit contraction, enhancing gradient magnitude and stability during training without altering the quantum circuit or measurement operators. Experiments show improved logit separation, gradient strength, and classification accuracy in protein and image classification tasks.

arxiv arXiv cs.LG · 21h ago

Deep material network for homogenization of piezoelectric composites

A piezoelectric deep material network (PDMN) is proposed to efficiently homogenize two-phase piezoelectric composites. The framework embeds electromechanical homogenization relations into its architecture, enabling physics-informed, semi-analytical predictions with over three orders of magnitude lower computational cost than direct numerical simulation, validated on PVDF-LiNbO3 and viscoelastic-piezoelectric composites under nonlinear loading.

arxiv arXiv cs.LG · 21h ago

Stationary Robust Mean-Field Games under Model Mismatches

This paper introduces a stationary mean-field game framework that directly incorporates distributional model uncertainty into population-coupled dynamics. It establishes a robust dynamic programming principle, proves existence of a stationary robust equilibrium, and presents the first algorithm with convergence guarantees. The mean-field solution approximates finite-population equilibria and provides explicit non-asymptotic error bounds under model uncertainty.