Alibaba (Qwen) — korshunov.ai

Lab · Alibaba (Qwen)

Qwen-RobotManip, a Vision-Language-Action foundation model, enables large-scale training through unified alignment across representation, motion, and behavior. It uses open-source data to build a 38,100-hour pretraining corpus and demonstrates emergent generalization, outperforming prior state-of-the-art models in out-of-distribution settings and ranking first in RoboChallenge with a 20% relative improvement on real-robot platforms.

arxiv arXiv cs.AI · 8d ago

RubricsTree: Scalable Evaluation Framework for Personal Health Agents

RubricsTree introduces a hierarchical taxonomy of over 100 clinically-verifiable Boolean rubrics, evolved from 4,000 real user queries via human-in-the-loop curation. It enables scalable, expert-aligned evaluation of personal health agents by dynamically routing queries to relevant rubrics and outperforms baseline methods in alignment, context degradation detection, and model performance gains of up to 66% on HealthBench.

arxiv arXiv cs.CL · 8d ago

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

A study challenges the assumption that visual attention signals reliability in vision-language models. It finds near-zero correlation between spatial attention and accuracy, showing instead that self-consistency across reasoning paths is a stronger predictor of truth. Reliability is better explained by generation dynamics and internal state distributions, not visual attention patterns.

arxiv arXiv cs.CL · 8d ago

OPD-Evolver: On-Policy Distillation for Holistic Agent Evolving

OPD-Evolver introduces a slow-fast co-evolution framework that enables agents to select, act on, and reuse experience through on-policy self-distillation. It outperforms existing memory and training-based methods by up to 11.5% and 5.8% respectively, and demonstrates capability to challenge large-scale models like Qwen3.5-397B-A17B and Step-3.5-Flash.

arxiv arXiv cs.CL · 8d ago

EnvRL: Leveraging Environment Dynamics in Agentic RL

EnvRL introduces a framework that enhances agentic reinforcement learning by incorporating environment dynamics through state prediction and inverse dynamics objectives. It achieves significant gains in success rates on long-horizon benchmarks, improving Qwen-2.5-1.5B-Instruct performance from 72.8% to 77.4% on ALFWorld and from 56.8% to 67.0% on WebShop when trained with GRPO.

arxiv arXiv cs.CL · 8d ago

LLM-Designed Training Environment for RL with Multi-Agent Reasoning

The LLM-as-Environment-Engineer framework uses LLMs to automatically redesign training environments in reinforcement learning by analyzing failure trajectories and contextual data. On the MAPF-FrozenLake testbed, it outperforms larger proprietary LLMs and fixed-environment baselines, with Qwen3-4B achieving the strongest aggregate performance. Analysis shows that failure evidence and preserved working configurations are key, and the current RL checkpoint performs better than the base model as an environment engineer.

arxiv arXiv cs.CL · 9d ago

Language Models Encode Value of Their Current Trajectory

Qwen3-8B internally tracks the value of its current trajectory, defined as the likelihood of achieving its goals. This 'value' axis distinguishes confidence levels, backtracking behavior, and code correctness, and shows that preference optimization boosts confidence in rewarded behaviors. The model assigns low value to politically sensitive queries post-training, and fine-tuning increases confidence within specific domains.

media r/LocalLLaMA · 9d ago

HalBench Tests 29 Open Source Models on Sycophancy and Hallucination

HalBench evaluates 29 open-source LLMs on a custom benchmark for sycophancy and hallucination. Qwen 3.6 and Gemma 4 outperform larger models, with Qwen 3.6 achieving 36.6% pushback—higher than GPT-5.4 and Gemini 3.1 Pro. Model size does not correlate with honest responses, indicating that architecture and training data matter more than parameters.

arxiv arXiv cs.CL · 8d ago

MLLP-VRAIN's Simultaneous Speech Translation Submission for IWSLT 2026

The MLLP-VRAIN group submits a cascaded SimulST system using Parakeet and Qwen 3.5 models with adaptive black-box policies. For En→De, It, Zh, it employs ASR word-boosting and RAG with pre-translated exemplars in the new context track, achieving +5.82 XCOMET-XL improvement on MCIF En→De and an additional +1.03 gain via context integration.

arxiv arXiv cs.CL · 8d ago

ChLogic: Testing Logical Reasoning Robustness in Chinese Expressions

ChLogic evaluates how well large language models maintain logical reasoning when English logical structures are expressed in Chinese. It reveals a persistent English-Chinese performance gap, with back-translation improving results on general items but harming performance on difficult problems. The benchmark highlights the impact of surface realization, translation artifacts, and model-specific behaviors on multilingual reasoning.

arxiv arXiv cs.CL · 8d ago

Fine-tuning LLMs for Passive Depression Severity Estimation

A model fine-tuned on Qwen3.5-27B predicts PHQ-9 scores from AI dialogue transcripts, achieving MAE=2.6 and AUC=0.91 at the PHQ-9 >= 10 threshold. It maintains AUC > 0.87 across all PHQ-9 severity levels, demonstrating accurate depression severity estimation in real-world conversations without self-reporting.

blog Simon Willison · 9d ago

Georgi Gerganov praises Qwen3.6-27B for coding tasks

Georgi Gerganov confirms that Qwen3.6-27B is highly capable for coding tasks, noting its daily use on local hardware like M2 Ultra and RTX 5090. He describes using a minimal pi agent with a short system prompt to align it with his workflow, highlighting its utility for maintaining open-source projects.

media r/LocalLLaMA · 9d ago

Qwen Robot Suite Announced

Aliyun has launched the Qwen Robot Suite, a new set of AI-powered robotic tools. The suite aims to enable developers to build and deploy intelligent robots with enhanced capabilities.

media r/LocalLLaMA · 9d ago

Qwen3.6 27B Quantization Performance Test Results

A test comparing Q8 and IQ3 XXS turbo4 quantized versions of Qwen3.6 27B shows that Q8 excels in API safety and input sanitization, while IQ3 XXS turbo4 performs better in thread management and modular code design. The model recommends merging both approaches: using Q8 for initial launch protection and IQ3 XXS for atomic writes and thread lifecycle, forming a combined Phase 1 foundation.

media r/LocalLLaMA · 9d ago

Be wary of Qwen/Claude distillations - they're often worse than the base model

Distillations of Qwen and Claude models, such as Qwen 3.6 distilled with only 4,000 samples, rarely improve performance and often degrade quality. These models may exhibit a more 'Opus-like' style but fail to transfer actual capability, with some showing hallucinations and slower response times compared to the base models, as demonstrated in testing and user reports.

arxiv arXiv cs.LG · 9d ago

Hyperball Optimization for Faster Language Model Training

Hyperball is a simple optimizer wrapper that sets fixed Frobenius norms for weight matrices and their updates. It improves training speed and learning rate transfer in large models, achieving 20--30% token equivalent speedup over weight decay baselines on up to 1.2B parameter models.

media r/LocalLLaMA · 9d ago

Qwable-v1 Released as Distillation of Claude Fable-5

Qwable-v1, an open-weight model distilled from Anthropic's Fable-5, is now publicly available on Hugging Face. It captures 4,659 cleartext agentic-coding traces from Fable-5's public corpus and emits properly formatted <tool_use> XML calls to Claude-flavored tools, reflecting the original tool surface in its weights.

media r/LocalLLaMA · 9d ago

vLLM releases new streaming parser for Qwen3+ in nightly

vLLM has introduced a new streaming parser for Qwen3+ available in its nightly build, addressing issues like mid-turn stopping and failed streaming tool calls due to chunk boundaries. The update reportedly resolves these problems in limited testing, improving reliability for agentic workflows.

media r/LocalLLaMA · 8d ago

Someone awhile ago did a quant shootout for Qwen3.6

A Reddit post features a quantization performance comparison for Qwen3.6, with a user noting they performed rough mathematical calculations on the results. The post includes a visual chart and links to the original image and comments.

media r/LocalLLaMA · 8d ago

Quantitative Comparison of Qwen3.6 Model Performance

A Reddit post presents a quantitative comparison of Qwen3.6's performance in reduced-precision (quantized) versions. The author notes a rough calculation suggesting Qwen3.6 maintains strong performance even at lower bit depths, though the math is described as shoddy and not rigorously validated.

Qwen-RobotManip Achieves Generalization in Robotic Manipulation

RubricsTree: Scalable Evaluation Framework for Personal Health Agents

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

OPD-Evolver: On-Policy Distillation for Holistic Agent Evolving

EnvRL: Leveraging Environment Dynamics in Agentic RL

LLM-Designed Training Environment for RL with Multi-Agent Reasoning

Language Models Encode Value of Their Current Trajectory

HalBench Tests 29 Open Source Models on Sycophancy and Hallucination

MLLP-VRAIN's Simultaneous Speech Translation Submission for IWSLT 2026

ChLogic: Testing Logical Reasoning Robustness in Chinese Expressions

Fine-tuning LLMs for Passive Depression Severity Estimation

Georgi Gerganov praises Qwen3.6-27B for coding tasks

Qwen Robot Suite Announced

Qwen3.6 27B Quantization Performance Test Results

Be wary of Qwen/Claude distillations - they're often worse than the base model

Hyperball Optimization for Faster Language Model Training

Qwable-v1 Released as Distillation of Claude Fable-5

vLLM releases new streaming parser for Qwen3+ in nightly

Someone awhile ago did a quant shootout for Qwen3.6

Quantitative Comparison of Qwen3.6 Model Performance