Open weights — korshunov.ai

Open weights Page 1 / 11

Baseline Evaluation of Open-Source LLMs for Multi-Label ATT&CK Classification

A ground-truth dataset of 2,076 human-annotated sentences from 83 complex CTI reports was constructed and mapped to 114 ATT&CK techniques with \k{appa} = 0.68 inter-annotator agreement. Seven open-source LLMs ranging from 8B to 236B parameters were evaluated, achieving a maximum micro-averaged F1 score of 0.22. Parameter size showed a statistically significant positive correlation with F1 score, while prompt strategy and temperature did not yield significant improvements, indicating current open-source LLMs are insufficient for production-grade ATT&CK classification.

media r/LocalLLaMA · 8d ago

Cheapest way to run GLM 5.x locally without unified memory

A user explores cost-effective methods to run GLM 5.x locally using 4-bit quantization, such as IQ4_XS, without relying on unified memory. Options include CPU-only setups like Sapphire Rapids ES with DDR5, multi-GPU offloading, or similar-sized models. The user runs a 5900X + 128GB DDR4 + 7900XT 20GB system, successfully handling Minimax 2.7 at Q4_K_S and Qwen 3.6 27B at IQ4_XS.

arxiv arXiv cs.CL · 8d ago

LLMs Predict Dementia and Depression from Clinical Speech

A study uses open-weight large language models to assess dementia and depression severity from clinical interviews. LLMs achieve accurate zero-shot depression prediction (MAE 0.60) and improved dementia assessment with feature extraction (MAE 0.78), reducing errors by up to 35%. Pause-enriched transcripts match human transcriptions, supporting automated screening pipelines for neuropsychiatric disorders.

arxiv arXiv cs.CL · 8d ago

RubricsTree: Scalable Evaluation Framework for Personal Health Agents

RubricsTree introduces a hierarchical taxonomy of over 100 clinically-verifiable Boolean rubrics, evolved from 4,000 real user queries via human-in-the-loop curation. It enables scalable, expert-aligned evaluation of personal health agents by dynamically routing queries to relevant rubrics and outperforms baseline methods in alignment, context sensitivity, and model performance gains of up to 66% on HealthBench.

arxiv arXiv cs.CL · 8d ago

Encoding Al-Mawrid Dictionary with ISO LMF and TEI Lex-0

The paper details a methodology for digitizing the Al-Mawrid Arabic-English dictionary using ISO LMF and TEI Lex-0. It achieves 91% structural parsing accuracy and demonstrates 85% precision and 98% recall for synonyms, with 88% precision for morpho-semantic features, based on a sample of the letter Ayn. The study highlights TEI Lex-0 limitations in capturing Arabic semantic and morphological nuances and proposes a scalable prefix-based system for LLOD integration.

arxiv arXiv cs.CL · 8d ago

Darshana Graph: A Corpus for Comparative Indian Philosophy

Darshana Graph presents a corpus of over 125,000 text records from Hindu, Buddhist, and Jain philosophical sources. It includes a unique subset of 8,500 aligned records from 18 commentators across five schools, enabling cross-commentator comparison. The corpus supports stylometric analysis and a large language model pipeline that extracts philosophical concept relationships, revealing disagreement patterns and extraction limitations.

arxiv arXiv cs.LG · 8d ago

KANLib: A Modular and Efficient Kolmogorov-Arnold Network Framework

KANLib introduces a modular, extensible, and computationally efficient framework for Kolmogorov-Arnold Networks. It unifies core concepts from PyKAN, EfficientKAN, and FastKAN, supporting adaptive grid rescaling and fine-grained architectural customization while maintaining PyTorch compatibility. Experiments on the California Housing dataset show KANLib achieves competitive efficiency and reproduces established KAN performance.

arxiv arXiv cs.AI · 8d ago

IUU+DB: LLM-Driven Database for Illegal Fishing and Supply Chain Crimes

IUU+DB is a large language model-driven system that tracks illegal, unreported, and unregulated fishing, seafood fraud, and labor abuse. It extracts key data elements from diverse documents, classifies relevant incidents, and enables trend analysis to identify geographic and behavioral hotspots. The system supports research, risk assessments, and policy enforcement in fisheries and supply chains.

arxiv arXiv cs.AI · 8d ago

Stanford EDGAR Filings Dataset Released

Stanford introduces SEFD, an open, layout-faithful reconstruction of SEC filings into MultiMarkdown. The 152B-token SEFD-v1 dataset enables financial language modeling and includes benchmarks for forecasting and table transcription, with less than 0.1% overlap to Common Crawl.

arxiv arXiv cs.AI · 8d ago

RubricsTree: Scalable Evaluation Framework for Personal Health Agents

RubricsTree introduces a hierarchical taxonomy of over 100 clinically-verifiable Boolean rubrics, evolved from 4,000 real user queries via human-in-the-loop curation. It enables scalable, expert-aligned evaluation of personal health agents by dynamically routing queries to relevant rubrics and outperforms baseline methods in alignment, context degradation detection, and model performance gains of up to 66% on HealthBench.

arxiv arXiv cs.CL · 8d ago

LLM-Generated Stories Show Low Diversity

Large language models produce narratives that are more similar to each other than human-written stories. Frontier models converge on a generic narrative pattern, lacking the diversity found in human-authored stories. Common techniques like negative prompting and temperature scaling do not significantly reduce this homogeneity.

arxiv arXiv cs.CL · 8d ago

Pruned LLMs Fail in Open Generation Despite Passing Multiple Choice

Pruned large language models often pass multiple-choice tests but fail to generate correct answers in open-ended responses. This 'benchmark illusion' shows that answers are not erased but demoted, reappearing only with advanced generation techniques like beam search or sampling. Standard benchmarks overstate the practical usability of compressed models, highlighting a critical evaluation blind spot.

media r/LocalLLaMA · 8d ago

I didn't know it was possible to compile llamacpp to run CUDA + Vulkan at the same time

A user compiled llamacpp with both CUDA and Vulkan support to leverage two GPUs, the w7800 and another card. The setup achieved +10% tokens/sec in decoding for a MiniMax-M3-UD-IQ2_M-00001-of-00004.gguf model, with plans to run benchmarks to assess real performance gains.

media r/LocalLLaMA · 8d ago

Is Le Gros Chaton Open Source?

A Reddit post asks whether Le Gros Chaton, an upcoming Mistral model, will be open source. The model is described as having 1B context, self-improving capabilities, and generating code in French, though it shuts down every three hours and refuses to respond before breakfast. The post also humorously questions if 'le chaton fat' is still acceptable terminology.

media r/LocalLLaMA · 8d ago

GLM-5.2 Releases Open Weights with Strong Coding Performance

GLM-5.2 has launched with open weights, a 1M context window, MIT license, and two reasoning modes. Early results show it ranks near the top in coding benchmarks, indicating strong real-world potential beyond API-only models.

media r/LocalLLaMA · 8d ago

GLM 5.2 API Live, Weights on Hugging Face, Ollama Support

GLM 5.2's API is now live, with model weights available on Hugging Face under MIT license and supported by Ollama. The model offers two thinking modes—High and Max—with 1M context length, priced at $1.4 per 1M input tokens and $4.4 per 1M output tokens, matching GLM-5.1.

media r/LocalLLaMA · 8d ago

We Open Sourced Our LLM-based QA Agent To Catch Breakages Faster

Approxima is an open-source, self-hostable QA agent that monitors user journeys and supports Claude, Gemini, and GPT out of the box. It features Explore Mode, A/B Testing, and Self-healing to adapt to product evolution, with full support for local models and community contributions.

media r/LocalLLaMA · 8d ago

Evalatro: an open benchmark where LLMs play real Balatro

Evalatro is an open benchmark that allows LLMs to play the actual game Balatro. Models receive game state as text, make decisions independently, and compete to reach Ante 12, with current results showing limited progress—mimo-v2.5-pro reached Ante 5, and deepseek-v4-pro failed to beat Ante 8.

media r/LocalLLaMA · 9d ago

Benchmark for tiny LLMs on natural language file search

A benchmark evaluates small LLMs (0.3B–3B params) on parsing natural language queries into structured JSON, focusing on file type, temporal context, specificity, and combined queries. Results show models with 0.8B–1.5B parameters outperform sub-0.5B ones, with the project aiming to expand the test set and explore fine-tuning for improved performance.

media r/LocalLLaMA · 9d ago

GLM-5.2 crosses 80% on Terminal-Bench

GLM-5.2 is the first open-weights model to achieve 80% accuracy on Terminal-Bench and outperforms all other available open models. It also surpasses Gemini, positioning it as a frontier-level model at a significantly lower cost.