Open weights — korshunov.ai

Topic · Open weights

PragReST is a self-supervised framework that enhances large language models' pragmatic reasoning by generating counterfactual reasoning traces and training via supervised fine-tuning and reinforcement learning. It outperforms baseline models on four pragmatic benchmarks, improving Qwen3-8B and Qwen3-14B by 5.37% and 5-5.50% accuracy respectively, and maintains strong performance on general-knowledge and mathematical reasoning tasks.

arxiv arXiv cs.CL · 7d ago

Misfired Alignment in LLMs: A Quantitative Study

A new study introduces VETO, a benchmark of 2,032 BBQ-derived contrastive pairs, to quantify misfired alignment in large language models. It defines the Misfired Alignment Rate (MAR) and finds that all benchmarked LLMs exhibit MARs between 4.7% and 18.9%, while human participants achieve 0%. The research shows alignment cues can amplify these failures, with evidence suppression occurring in late layers of models and emerging after instruction training.

github llama.cpp · 7d ago

Metal backend adds f16 and bf16 support for concat operator

The Metal backend in llama.cpp has been extended to support f16 and bf16 tensor types for the concat operator, in addition to existing f32 and i32 support. This update includes specialized kernel templates, updated pipeline getters, and improved type-based kernel dispatch, with assistance from pi:llama.cpp/Qwen3.6-27B.

arxiv arXiv cs.CL · 8d ago

RubricsTree: Scalable Evaluation Framework for Personal Health Agents

RubricsTree introduces a hierarchical taxonomy of over 100 clinically-verifiable Boolean rubrics, evolved from 4,000 real user queries via human-in-the-loop curation. It enables scalable, expert-aligned evaluation of personal health agents by dynamically routing queries to relevant rubrics and outperforms baseline methods in alignment, context sensitivity, and model performance gains of up to 66% on HealthBench.

arxiv arXiv cs.AI · 8d ago

RubricsTree: Scalable Evaluation Framework for Personal Health Agents

RubricsTree introduces a hierarchical taxonomy of over 100 clinically-verifiable Boolean rubrics, evolved from 4,000 real user queries via human-in-the-loop curation. It enables scalable, expert-aligned evaluation of personal health agents by dynamically routing queries to relevant rubrics and outperforms baseline methods in alignment, context degradation detection, and model performance gains of up to 66% on HealthBench.

media r/LocalLLaMA · 9d ago

GLM-5.2 crosses 80% on Terminal-Bench

GLM-5.2 is the first open-weights model to achieve 80% accuracy on Terminal-Bench and outperforms all other available open models. It also surpasses Gemini, positioning it as a frontier-level model at a significantly lower cost.

arxiv arXiv cs.CL · 9d ago

LOGOS: A General-Purpose Generative Model for Natural Sciences

LOGOS is a unified generative language model that represents scientific objects and their interactions as token sequences in a shared grammar. It achieves consistent or superior performance across diverse natural science tasks, demonstrating the feasibility of a single model serving multiple domains. The model scales positively with parameter count, and its design suggests that AI for Science should align deeply with large language models through shared architectures and training.

arxiv arXiv cs.CL · 7d ago

TW-LegalBench: Evaluating LLMs on Taiwanese Law

TW-LegalBench introduces a benchmark using Taiwan's public legal corpus to assess large language models' performance in Taiwanese law. It includes 16,000+ multiple-choice questions, 117 open-ended essay questions with scoring rubrics, and 14,000+ judgment prediction instances. Evaluation shows top models exceed lawyer passing thresholds (11%) but fall short of judge/prosecutor levels (1-2%), and struggle with precise legal article citations in sentencing predictions.

arxiv arXiv cs.CL · 7d ago

G-IdiomAlign: Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment

G-IdiomAlign introduces a gloss-pivoted benchmark using English glosses from Wiktionary to anchor idioms. It includes controlled multiple-choice equivalence and gloss-contrastive generation protocols, showing that glosses improve performance in semantic alignment, though results remain modest, indicating significant potential for improvement in cross-lingual idiom generation.

arxiv arXiv cs.AI · 7d ago

CADE: Direct Timestep Embedding for Time-Series Question Answering

CADE introduces direct timestep embedding and contrastive alignment to preserve metric structure in time-series data. By mapping each timestep directly into LLM embedding space, it avoids tokenization bottlenecks and outperforms existing LLM baselines on six TSQA tasks.

arxiv arXiv cs.AI · 7d ago

G-IdiomAlign: Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment

G-IdiomAlign introduces a gloss-pivoted benchmark using English glosses from Wiktionary to anchor idioms. It includes controlled multiple-choice equivalence and gloss-contrastive generation protocols, showing that glosses improve performance in embedding-based semantic alignment, though results remain modest, indicating significant potential for improvement in cross-lingual idiom generation.

arxiv arXiv cs.AI · 7d ago

ARIADNE: Agnostic Routing for Inference-time Adapter Selection

ARIADNE enables dynamic, training-free adapter selection at inference time by using centroids from adapter training data embeddings. It selects the most appropriate adapter based on proximity in latent space, without requiring access to adapter internals or additional training, and achieves 89.7% average selection accuracy across 44 NLP tasks.

blog Simon Willison · 7d ago

GLM-5.2 is the leading open weights model on the Artificial Analysis Intelligence Index

GLM-5.2, a 753B-parameter text-only model from Z.ai, is now the top open weights model on the Artificial Analysis Intelligence Index, outperforming MiniMax-M3, DeepSeek V4 Pro, and Kimi K2.6. It features a 1 million token context window and ranks second on the Code Arena WebDev leaderboard, despite lacking image input capabilities.

media r/LocalLLaMA · 7d ago

Gemma 4 E2B runs at 255 tok/s in browser using WebGPU

Gemma 4 E2B achieves 255 tokens per second in-browser on an M4 Max using WebGPU kernels. The demo and kernels are now available on Hugging Face for public use.

media r/LocalLLaMA · 8d ago

Local models went from mostly useless to actually useful in one year

Local models transitioned from being primarily privacy-focused toys to practical tools for coding, private document management, and local workflows within a year. While they still fall short of replacing top closed models for complex tasks requiring planning and error correction, the overall improvement in usability and performance is evident.

arxiv arXiv cs.LG · 8d ago

Baseline Evaluation of Open-Source LLMs for Multi-Label ATT&CK Classification

A ground-truth dataset of 2,076 human-annotated sentences from 83 complex CTI reports was constructed and mapped to 114 ATT&CK techniques with \k{appa} = 0.68 inter-annotator agreement. Seven open-source LLMs ranging from 8B to 236B parameters were evaluated, achieving a maximum micro-averaged F1 score of 0.22. Parameter size showed a statistically significant positive correlation with F1 score, while prompt strategy and temperature did not yield significant improvements, indicating current open-source LLMs are insufficient for production-grade ATT&CK classification.

arxiv arXiv cs.CL · 8d ago

LLMs Predict Dementia and Depression from Clinical Speech

A study uses open-weight large language models to assess dementia and depression severity from clinical interviews. LLMs achieve accurate zero-shot depression prediction (MAE 0.60) and improved dementia assessment with feature extraction (MAE 0.78), reducing errors by up to 35%. Pause-enriched transcripts match human transcriptions, supporting automated screening pipelines for neuropsychiatric disorders.

arxiv arXiv cs.AI · 8d ago

IUU+DB: LLM-Driven Database for Illegal Fishing and Supply Chain Crimes

IUU+DB is a large language model-driven system that tracks illegal, unreported, and unregulated fishing, seafood fraud, and labor abuse. It extracts key data elements from diverse documents, classifies relevant incidents, and enables trend analysis to identify geographic and behavioral hotspots. The system supports research, risk assessments, and policy enforcement in fisheries and supply chains.

arxiv arXiv cs.AI · 8d ago

Stanford EDGAR Filings Dataset Released

Stanford introduces SEFD, an open, layout-faithful reconstruction of SEC filings into MultiMarkdown. The 152B-token SEFD-v1 dataset enables financial language modeling and includes benchmarks for forecasting and table transcription, with less than 0.1% overlap to Common Crawl.

arxiv arXiv cs.CL · 8d ago

LLM-Generated Stories Show Low Diversity

Large language models produce narratives that are more similar to each other than human-written stories. Frontier models converge on a generic narrative pattern, lacking the diversity found in human-authored stories. Common techniques like negative prompting and temperature scaling do not significantly reduce this homogeneity.

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

Misfired Alignment in LLMs: A Quantitative Study

Metal backend adds f16 and bf16 support for concat operator

RubricsTree: Scalable Evaluation Framework for Personal Health Agents

RubricsTree: Scalable Evaluation Framework for Personal Health Agents

GLM-5.2 crosses 80% on Terminal-Bench

LOGOS: A General-Purpose Generative Model for Natural Sciences

TW-LegalBench: Evaluating LLMs on Taiwanese Law

G-IdiomAlign: Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment

CADE: Direct Timestep Embedding for Time-Series Question Answering

G-IdiomAlign: Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment

ARIADNE: Agnostic Routing for Inference-time Adapter Selection

GLM-5.2 is the leading open weights model on the Artificial Analysis Intelligence Index

Gemma 4 E2B runs at 255 tok/s in browser using WebGPU

Local models went from mostly useless to actually useful in one year

Baseline Evaluation of Open-Source LLMs for Multi-Label ATT&CK Classification

LLMs Predict Dementia and Depression from Clinical Speech

IUU+DB: LLM-Driven Database for Illegal Fishing and Supply Chain Crimes

Stanford EDGAR Filings Dataset Released

LLM-Generated Stories Show Low Diversity