Open weights — korshunov.ai

Open weights Page 1 / 11

GLM-5.2 Is The Best Open Weight Creative Writing Model

Sam Paech's Creative Writing Benchmark on EQ Bench ranks GLM-5.2 as the top open-weight creative writing model. The assessment is based on performance metrics from the EQ Bench creative writing evaluation.

media r/LocalLLaMA · 7d ago

The power of intelligence is better in the hands of the people than in the board rooms of tycoons

The PearlOS project has launched an open-source swarm intelligence platform that uses local models to handle multimodal tasks. It automatically selects and switches between top-performing models based on benchmarks, ensuring users always access the latest and most capable models without relying on closed-source systems or subscriptions.

media r/LocalLLaMA · 7d ago

GLM's founder says GLM-fable before the end of the year?

GLM's founder has stated on Reddit that a GLM-fable may be released before the end of the year. The post originates from a user discussion on the LocalLLaMA subreddit, where the claim is presented without confirmation or official announcement.

media r/LocalLLaMA · 7d ago

OSS models decisively overtook proprietary models in market share

Based on the last three months of OpenRouter data, open-source models have surpassed proprietary models in market share. The analysis highlights a significant shift toward open-source language models in the broader AI landscape.

media r/LocalLLaMA · 7d ago

Does anyone have enough compute to make a distillation dataset from GLM5.2?

A user asks if anyone with sufficient computing resources can create a large distillation dataset of 70-1 million examples from GLM5.2. The goal is to enable better training of smaller models like Qwen3.5, benefiting the broader community.

media r/LocalLLaMA · 7d ago

LocalLLaMA proposes crowdsourced coding dataset

A community initiative suggests creating a crowdsourced coding dataset to enable local LLM development. The proposal aims to allow anyone with hardware to contribute data, with more powerful users helping to fine-tune or quantize models, thus reducing reliance on company-released models.

arxiv arXiv cs.LG · 7d ago

NeSyCat Torch: Differentiable Tensor Implementation for Neurosymbolic Learning

NeSyCat Torch provides a differentiable tensor implementation of categorical semantics for neurosymbolic learning, unifying classical, fuzzy, probabilistic, and neural systems under a single inductive truth definition. It outperforms LTN and DeepProbLog in speed and accuracy on MNIST addition, matching DeepStochLog's accuracy while operating within a uniform framework extendable to continuous probability via monad instantiation.

arxiv arXiv cs.LG · 7d ago

Reverse-Engineering Transformer Attention with Executable Programs

A new method uses program synthesis to generate Python programs that reproduce attention patterns in transformer models. These programs achieve over 75% average Intersection-over-Union similarity on held-out data and can replace up to 25% of attention heads with minimal impact on model performance, increasing perplexity by only 16% on average.

arxiv arXiv cs.LG · 7d ago

LOCUS: A Local Ordinance Corpus for the United States

LOCUS provides machine-readable access to U.S. municipal and county ordinances, covering 9,239 cities and counties. It includes a county-harmonized layer for 2,309 of 3,144 U.S. counties, serving the majority of the population. The corpus, built with OCR and metadata, enables research on legal opacity and paternalism using ModernBERT-based models.

arxiv arXiv cs.AI · 7d ago

User as Engram: Local Parametric Edits for Personal Memory

User as Engram proposes storing per-user facts as surgical, hash-keyed edits to a memory table, leaving reasoning in a shared adapter. This design achieves 5.6x higher indirect-reasoning accuracy and maintains base-level reasoning performance, with a memory footprint 33,000x smaller than per-user LoRA. The approach enables disjoint user edits that compose losslessly, outperforming retrieval pipelines beyond 100 facts.

arxiv arXiv cs.AI · 7d ago

NeSyCat Torch: Differentiable Tensor Implementation for Neurosymbolic Learning

NeSyCat Torch provides a differentiable tensor implementation of categorical semantics for neurosymbolic learning, unifying classical, fuzzy, probabilistic, and neural systems under a single inductive truth definition. It outperforms LTN and DeepProbLog in speed and accuracy on MNIST addition, matching DeepStochLog's accuracy while operating within a uniform framework extensible to continuous probability via monad instantiation.

arxiv arXiv cs.CL · 7d ago

Dango: A Strictly L1-Only LLM for SLA Research

Dango is a 1.8B-parameter LLM designed to study Japanese-to-English second language acquisition. It uses a filtering method to minimize English contamination in monolingual pretraining, preserving realistic L1 exposure. Fine-tuned on LLM-generated lessons, Dango produces human-like L2 outputs, outperforming unfiltered and standard multilingual models.

arxiv arXiv cs.CL · 7d ago

RECOM: Validity-Discrimination Tradeoff in Reddit QA Metrics

RECOM evaluates 15,000 r/AskReddit questions with authentic community replies posted after model training. It shows no automatic metric simultaneously achieves strong validity and discriminative power, with BERTScore ranking models weakly even when length is controlled. The tradeoff arises from representation design, not model differences, and requires reporting both validity and discrimination with random-baseline floors.

arxiv arXiv cs.CL · 7d ago

DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning

DreamReasoner-8B is an open-source block diffusion model that demonstrates strong long-chain-of-thought reasoning. A systematic study shows that small training block sizes preserve reasoning effectiveness, while large sizes degrade performance. Block-size curriculum learning gradually transitions training from fine to coarse blocks, enabling robust and generalizable reasoning across inference settings, with results competitive to Qwen3-8B on mathematical and code benchmarks.

arxiv arXiv cs.CL · 7d ago

LOCUS: A Local Ordinance Corpus for the United States

LOCUS provides machine-readable access to nearly all publicly available U.S. municipal and county ordinance codes, covering 9,239 cities and counties. It includes a county-harmonized access layer for 2,309 of 3,144 U.S. counties, serving the majority of the population. The corpus, built with OCR and metadata for reproducibility, enables large-scale analysis of local law, including dimensions like opacity and paternalism, using ModernBERT-based models.

arxiv arXiv cs.CL · 7d ago

BCL: Bayesian In-Context Learning for Information Extraction

BCL is the first framework that uses particle filtering and Bayesian updates to systematically refine label representations in information extraction. It achieves consistent performance across model scales and generalizes to both sequence labeling and relation classification through four key steps: initialization, observation, weight update, and resampling.

arxiv arXiv cs.CL · 7d ago

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

PragReST is a self-supervised framework that enhances large language models' pragmatic reasoning by generating counterfactual reasoning traces and training via supervised fine-tuning and reinforcement learning. It outperforms baseline models on four pragmatic benchmarks, improving Qwen3-8B and Qwen3-14B by 5.37% and 5-5.50% accuracy respectively, and maintains strong performance on general-knowledge and mathematical reasoning tasks.

arxiv arXiv cs.CL · 7d ago

Misfired Alignment in LLMs: A Quantitative Study

A new study introduces VETO, a benchmark of 2,032 BBQ-derived contrastive pairs, to quantify misfired alignment in large language models. It defines the Misfired Alignment Rate (MAR) and finds that all benchmarked LLMs exhibit MARs between 4.7% and 18.9%, while human participants achieve 0%. The research shows alignment cues can amplify these failures, with evidence suppression occurring in late layers of models and emerging after instruction training.

arxiv arXiv cs.CL · 7d ago

TW-LegalBench: Evaluating LLMs on Taiwanese Law

TW-LegalBench introduces a benchmark using Taiwan's public legal corpus to assess large language models' performance in Taiwanese law. It includes 16,000+ multiple-choice questions, 117 open-ended essay questions with scoring rubrics, and 14,000+ judgment prediction instances. Evaluation shows top models exceed lawyer passing thresholds (11%) but fall short of judge/prosecutor levels (1-2%), and struggle with precise legal article citations in sentencing predictions.

arxiv arXiv cs.CL · 7d ago

LLMs Struggle with Negation in Figurative Language

A study finds that large language models struggle to interpret negation in figurative language. Performance varies significantly based on prompt style, highlighting a key limitation in current models' understanding of complex linguistic structures.