All articles — korshunov.ai

All articles Page 1 / 96

Tapered Language Models: Improving Performance via Depth-Aware Capacity Allocation

Modern language models typically allocate parameters uniformly across identical layers, despite evidence that later layers primarily refine the residual stream rather than transform it. To address this asymmetry, researchers investigated whether parameter capacity should vary by depth under a fixed budget. Controlled experiments demonstrated that allocating more capacity to earlier layers and less to later layers improves perplexity compared to uniform baselines, while the reverse allocation degrades performance. Building on these results, the authors introduce Tapered Language Models (TLMs), an architectural principle where parameter-bearing components are monotonically tapered across depth. MLPs serve as the primary site for this instantiation due to their dominance in parameter count and clear width axis. The study tested tapering via a smooth cosine schedule across three model scales and four architectures, including Transformer, Gated Attention, Hope-attention, and Titans. Results show that TLMs consistently improve perplexity and downstream benchmark performance over uniform baselines without additional compute costs. These findings establish depth-aware capacity allocation as a simple, architecture-agnostic design lever for language models.

arxiv arXiv cs.AI · 6h ago

NVIDIA Nemotron Challenge: String Matching and Backtracking for Bit Manipulation Puzzles

This paper details algorithmic innovations developed for the NVIDIA Nemotron Model Reasoning Challenge, specifically targeting bit manipulation puzzles where models must deduce hidden logical rules. To address the combinatorial explosion of bitwise operations and LLM hallucinations, the authors abandon arithmetic logic in favor of string similarity and structured search. The core contribution reframes logic-gate deduction as a base-selection task using minimal bit flips to isolate primitive transformations. A backtracking depth-first search process is formalized to test candidates, detect logical collisions, and perform robust error recovery. Additionally, the method employs bit tokenization and interactive reasoning supervised fine-tuning with dynamic masking to simulate oracle feedback. Evaluated on these puzzles, the approach achieved over 96% validation accuracy. This performance secured the highest result in the category and a seventh-place finish in the overall contest.

arxiv arXiv cs.AI · 6h ago

PsyBridge: A Hybrid Framework for Multi-Dimensional Mental Health Assessment

The study introduces PsyBridge, a hybrid intelligent framework designed to address the limitations of isolated screening instruments in mental health assessment. This system integrates clinically validated tools like PHQ-9 and GAD-7 with cognitive evaluation and personality profiling within a unified architecture. A modular design employing a weighted aggregation mechanism generates interpretable risk classifications and recommendations for users. To evaluate performance, researchers constructed a semi-synthetic dataset comprising 500 patient profiles based on clinically grounded score distributions. Experimental results show that PsyBridge achieves an overall accuracy of 0.84, outperforming standalone PHQ-9 and GAD-7 assessments. The framework also demonstrates improvements in precision, recall, and F1-score compared to existing methods. Sensitivity analysis confirms that integrating cognitive and personality components stabilizes classification performance and reduces prediction inconsistencies. These findings suggest PsyBridge offers a scalable approach for AI-assisted decision support in digital healthcare environments.

arxiv arXiv cs.AI · 6h ago

Open Problem: Is AdamW Effective Under Heavy-Tailed Noise?

AdamW serves as the standard optimizer for training large language models, yet its theoretical foundation remains largely confined to finite-variance regimes. This gap is significant because empirical evidence suggests that stochastic gradient noise during LLM pretraining typically exhibits heavy-tailed characteristics. Recent studies have demonstrated that sign-based optimizers like Lion and Muon achieve sharp convergence rates under heavy-tailed conditions, while AdaGrad also converges in this setting. However, rigorous convergence theory for AdamW has not yet been established within these heavy-tailed assumptions. The authors pose an open problem regarding whether AdamW can converge under the same heavy-tailed assumptions or if its second-moment accumulator creates a genuine obstruction. To address this, they formulate a positive weighted-metric benchmark and provide a corridor lower-bound mechanism. This mechanism illustrates how denominator memory in AdamW can effectively hide large gradients, potentially impacting its performance.

arxiv arXiv cs.AI · 7h ago

AIR: Adaptive Interleaved Reasoning with Code in MLLMs

This paper introduces AIR, a method that empowers multimodal large language models with adaptive interleaved reasoning capabilities through extended reinforcement learning training on code-augmented complex numerical computation tasks. The authors address the limitation of existing literature, which primarily focuses on tool-use within vision-perception tasks and relies on predefined heuristics incapable of handling numerical computations. To solve this, they propose a comprehensive three-component solution including a two-stage cold-start data construction pipeline, data filtering strategies for reinforcement learning dataset curation, and an adaptive tool-invocation strategy leveraging a group-constrained reward function. Extensive experiments demonstrate that after reinforcement learning training with this reward function, performance improves by an average of 6.1 percentage points on evaluation benchmarks. Specifically, the accuracy for interleaved reasoning samples increases by 9.9 percentage points, while the overall success rate of tool-use exceeds 95 percent. The researchers provide their data and code for public access at a specified GitHub repository.

arxiv arXiv cs.AI · 7h ago

Semantic Browsing: Controllable Diversity for Image Generation

Modern text-to-image models often suffer from diversity collapse despite high fidelity. The authors introduce Semantic Browsing to enable controlled diversity through structured image galleries. This method allows users to navigate meaningful axes of variation rather than incidental noise. The approach exploits the decoupling of semantic decision-making and pixel generation in recent models. Diversity is induced directly at the text level using rich textual representations. A Vision Language Model operates on full scene context within an agentic workflow. This workflow explicitly enforces structured variation attuned to the original prompt. The result is a navigable design space with interpretable semantic decisions.

arxiv arXiv cs.AI · 7h ago

CoorDex: Coordinating Body and Hand Priors for Continuous Dexterous Humanoid Loco-Manipulation

The authors introduce CoorDex, a learning pipeline that enables high-degree-of-freedom dexterous loco-manipulation on moving humanoids. This approach converts high-dimensional body and hand control into coordinated latent residual control, overcoming the limitations of traditional stop-and-go methods. The system trains privileged motion tracking teachers from simulated demonstrations and distills them into proprioception-conditioned latent priors. These frozen priors serve as the action space for downstream residual reinforcement learning via a policy that composes task context with separate body-hand residual heads. CoorDex allows a Unitree G1 humanoid equipped with a 20-DoF WUJI hand to perform complex tasks while in motion, such as non-stop bottle grasping and fridge door opening. Ablation studies demonstrate that joint-space PPO and monolithic latent prediction fail under similar reward budgets, whereas the proposed latent-prior interface ensures trainability for contact-rich manipulation.

arxiv arXiv cs.LG · 7h ago

Encoder-Decoder Manifold Alignment for Idempotent Generation

Recent learning paradigms aim to enforce idempotency in generative models by ensuring repeated application leaves samples unchanged on the target data manifold. However, many existing approaches fail to achieve exact fixed points, resulting in instability and drift during repeated applications. The authors identify a geometric mismatch between encoder and decoder manifolds as the primary cause of this failure. To resolve this, they propose a training framework that explicitly aligns the geometry of both components to learn consistent representations of the same underlying data manifold. This alignment encourages stable projections and significantly reduces idempotency error compared to prior methods. Empirical results demonstrate that the approach consistently regenerates identical outputs under repeated application for both image generation and editing tasks. Furthermore, enforcing this type of idempotency improves identity preservation and information stability in generative models.

arxiv arXiv cs.LG · 7h ago

Manifold Restore Mixing Enhances Protein Representation Learning

Data augmentation improves protein representation learning but often disrupts structural integrity or reduces diversity. The authors identify these structure defects and performance degradation issues in existing methods. They propose Manifold Restore Mixing (MRM) to restore lost structural information while introducing diverse variations. MRM mixes hidden representations of original and augmented data, inspired by manifold mixup techniques. A sample difficulty scheduler adjusts the beta distribution to provide progressively challenging samples during training. Experiments on various backbones and downstream tasks demonstrate the method's effectiveness and generalization. The implementation is available at https://github.com/KingGugu/MRM.

arxiv arXiv cs.LG · 7h ago

Entropy-Guided Boundary Supervision for Breast Ultrasound Segmentation

This study introduces an entropy-guided boundary supervision method to address boundary leakage and false-positive activations in breast ultrasound segmentation. The proposed loss function scales contour penalties by per-pixel predictive entropy and ground-truth maps, focusing gradient emphasis on uncertain lesion margins. Evaluated on the BUSI dataset, the method preserved lesion segmentation quality with a mean Dice score of 0.7624, statistically indistinguishable from the baseline. However, it significantly improved specificity by reducing false-positive activations on no-lesion images from 19 of 20 to 5 of 20. A post-hoc spatial temperature scaling step further reduced the expected calibration error from 0.0201 to 0.0095 without altering segmentation masks. These results demonstrate that entropy-guided supervision and spatial calibration function as complementary refinements within a U-Net framework.

arxiv arXiv cs.LG · 7h ago

Diffusion Integrated Gradients: Controllable Path Generation for Flexible Feature Attribution

The authors propose Diffusion Integrated Gradients (DiffIG), a novel method that reformulates path generation as a conditional generative modeling problem to address limitations in existing attribution techniques. While integrated gradients are widely used, their reliance on fixed or hand-crafted paths often results in noisy or distorted attributions. To solve this, DiffIG trains a diffusion model to learn a distribution over paths derived from a Stick-Breaking Process. The method then employs guided sampling to allow for the embedding of user guidance during the inference-time sampling procedure. This approach enables flexible and controllable feature attribution by treating path selection as a generative task rather than a static choice. Experimental results demonstrate that DiffIG quantitatively matches or outperforms existing path-based methods in terms of attribution quality. Furthermore, the generated explanations are shown to be perceptually aligned with human expectations. The work introduces a new generative perspective for Explainable Artificial Intelligence that supports dynamic control over explanation paths.

arxiv arXiv cs.LG · 7h ago

First Finite-Time Analysis of Classical Adam for Nonsmooth Nonconvex Optimization

This study presents the first finite-time convergence analysis for the classical Adam optimizer, specifically addressing its behavior in nonsmooth nonconvex optimization settings. Previous research largely ignored Adam's bias-correction term or required extra algorithmic modifications like clipping, leaving the original method's guarantees unclear. The authors utilize the Online-to-Nonconvex Conversion framework to prove that a randomly scaled learning rate ensures a convergence rate of $1/T^{ rac{2}{13}}$. This theoretical result is significant because it applies to the modern heavy-tailed noise regime, which more closely reflects practical training conditions. Furthermore, the analysis establishes convergence under the parameter choice where $β_1=β_2$, aligning with recent empirical observations. These findings provide a rigorous explanation for Adam's effectiveness in real-world scenarios that were previously inadequately captured by smooth optimization theories.

arxiv arXiv cs.LG · 7h ago

Boundary-Aware Curriculum RL Expands LLM Reasoning Capacity Beyond Base Model Limits

The authors argue that mainstream Reinforcement Learning with Verifiable Rewards (RLVR) often fails to expand the reasoning capacity of large language models, merely reallocating probabilities among existing trajectories. To address this limitation, they introduce a boundary-aware Curriculum RL approach designed to move beyond the base model's empirical reasoning capacity boundary. The method first utilizes pass@k sampling to identify the current reasoning limits and then applies targeted teacher guidance to examples near or beyond that boundary. Reinforcement learning is subsequently used to consolidate these newly introduced reasoning patterns across Qwen, Llama, and DeepSeek base models. Experimental results demonstrate significant improvements in both pass@1 scores and pass@256 scores, which serve as a proxy for the reasoning capacity boundary. Specifically, average pass@256 improved by 9.8 percentage points over the base models and by 10.3 percentage points over Vanilla RLVR. These findings suggest that this curriculum-based strategy offers a scalable route for continuously improving LLM reasoning capabilities.

arxiv arXiv cs.LG · 8h ago

Attention Sinks and Collapse Are Universal Consequences of Content-Based Routing

The study demonstrates that attention sinks, representation collapse, and norm stratification are not unique to transformer architectures but are inherent consequences of content-based routing under a fixed similarity metric. It establishes an identity showing softmax attention functions as Boltzmann-weighted aggregation over Euclidean distances with constant key norms, rendering it blind to key magnitude due to the omission of a specific norm term. This framework predicts that any router utilizing a metric ill-matched to its representations will compensate by concentrating routing and collapsing the routed representations. The authors validate this hypothesis across diverse models including nine pretrained transformers, graph attention networks, selective state-space models, recurrent mixers, and learned residual layers. Experimental results confirm that all tested architectures exhibit this identical signature of collapse regardless of their specific domain or structure. Furthermore, within-model ablations isolate the routing mechanism as the primary cause rather than incidental training dynamics. The onset of this phenomenon is shown to be contingent on the strength of the positional brake accompanying the content score, which can shift the effect across its range. However, the underlying mechanism remains invariant and does not require norm stratification, as routers with norm-normalized keys exhibit the same concentration behavior.

media r/LocalLLaMA · 8h ago

User Reports Strong Performance of siq1 Model on Kebab Bench

A Reddit user has shared results indicating that their model, referred to as siq1, performs very well on the Kebab Bench evaluation. The post highlights the model's capabilities through a demonstration hosted on Hugging Face Spaces. Specifically, the user points to a space titled 'hermes-agent-zerogpu' created by AlexWortega as evidence of this performance. This submission was made by the Reddit user /u/Mysterious_Hearing14 within the r/LocalLLaMA community. The original post includes a link to the Hugging Face interface where the model can be tested. Additionally, a video demonstration is available via a provided V.redd.it link for further verification.

media r/LocalLLaMA · 8h ago

Inquiry Regarding the Availability of Modern Non-Chat Completion Models

A user on the LocalLLaMA subreddit questioned whether all modern large language models are exclusively tuned for chat interactions. The inquiry specifically sought to identify any models that support bare text completion rather than conversational formats. The poster noted a difficulty in locating such models within the Hugging Face repository. This highlights a perceived gap in the availability of non-chat architectures for users requiring raw completion capabilities. The discussion reflects broader concerns about the industry's shift toward instruction-tuned and chat-oriented model designs.

arxiv arXiv cs.LG · 8h ago

No Reference-Free Generalization in Quantum Machine Learning

This study addresses the identifiability problem in quantum machine learning where training data lacks a preferred basis or reference frame. The authors formulate supervised learning without an external quantum reference frame, requiring classifiers to preserve unitary symmetries unbroken by the training data. They prove that if training states do not span the full Hilbert space, all pure states orthogonal to this span receive identical predictions. This limitation arises from missing reference information rather than state discrimination or computational constraints. The research establishes a robust version under weak symmetry breaking and shows that learning generic concepts requires exponentially many oriented training directions. Numerical illustrations visualize the resulting prediction collapse and its controlled relaxation. The results identify feature maps, measurement bases, and diverse training states as essential operational resources for generalization.

arxiv arXiv cs.LG · 8h ago

Wearable A-Mode Ultrasound Enables Whole Hand Kinematic Tracking on Microcontroller

Researchers propose a framework for robust whole-hand and wrist kinematic tracking using the wearable WULPUS platform with an A-mode ultrasound probe. The system addresses the regression of 23 degrees of freedom directly on the device, overcoming limitations of prior non-wearable systems. A compact multi-output convolutional neural network containing 11,285 parameters is employed alongside an incremental training strategy to enhance generalization. This approach reduces mean absolute error by more than 17% compared to non-incremental methods. The model is deployed on the WULPUS nRF52832 microcontroller, achieving end-to-end tracking entirely on-device. Inference consumes only 0.73 mJ with a latency of 29.1 ms. The system supports full operation within 33 mW, enabling up to 36 hours of continuous use. This method also reduces wireless bandwidth requirements by 88% compared to raw data transmission.

arxiv arXiv cs.LG · 8h ago

Null-Calibrated Conformal Selection via Target-Membership Scores

The article introduces Null-Calibrated Conformal Selection (NCCS), a method that utilizes target-membership probability scores to identify test candidates within a target region while controlling the false discovery rate. The authors argue that these membership scores provide a more natural ranking for selection tasks than conventional prediction-oriented nonconformity scores, particularly for complex targets. This distinction is critical for interval-valued, variance-driven, multimodal, or multi-condition targets where traditional scores may be misaligned with selection power. NCCS ranks test scores against confirmed non-target calibration examples to yield finite-sample valid null p-values under null exchangeability. These p-values can be combined with the Benjamini-Yekutieli procedure under arbitrary dependence or the Benjamini-Hochberg procedure under standard positive-dependence conditions. Experiments demonstrate that membership scores match conventional scores on mean-monotone targets but substantially improve performance on variance-driven targets. In rare-target regimes, NCCS trades power for finite-sample null validity, addressing issues where direct empirical-FDP thresholding can be anti-conservative.

arxiv arXiv cs.LG · 8h ago

RoboMME-Interference Benchmarks Robot Memory Under Distraction

The introduction of RoboMME-Interference addresses the need for evaluating robot memory in realistic, long-context scenarios where systems must recall information from multiple sessions ago. This new cross-session benchmark is built upon the existing RoboMME framework to measure performance when robots face distractions from unrelated prior experiences. For each query episode, the benchmark constructs a session history consisting of relevant demonstrations followed by a controlled number of unrelated sessions provided as memory to Vision-Language-Action models. Researchers tested released memory-augmented variants of the π_0.5 model without modification to assess their robustness under these conditions. The results indicate that while perceptual memory variants improve success rates when no distractors are present, their accuracy decays steadily and strongly as unrelated sessions accumulate. These findings highlight a critical failure in current systems regarding long-context memory and interference resistance. The project page, videos, code, and data for this benchmark are available at https://robotmemorybench.com.