Open weights
arxiv arXiv cs.LG · 7d ago

NeSyCat Torch: Differentiable Tensor Implementation for Neurosymbolic Learning

NeSyCat Torch provides a differentiable tensor implementation of categorical semantics for neurosymbolic learning, unifying classical, fuzzy, probabilistic, and neural systems under a single inductive truth definition. It outperforms LTN and DeepProbLog in speed and accuracy on MNIST addition, matching DeepStochLog's accuracy while operating within a uniform framework extendable to continuous probability via monad instantiation.

arxiv arXiv cs.AI · 7d ago

User as Engram: Local Parametric Edits for Personal Memory

User as Engram proposes storing per-user facts as surgical, hash-keyed edits to a memory table, leaving reasoning in a shared adapter. This design achieves 5.6x higher indirect-reasoning accuracy and maintains base-level reasoning performance, with a memory footprint 33,000x smaller than per-user LoRA. The approach enables disjoint user edits that compose losslessly, outperforming retrieval pipelines beyond 100 facts.

arxiv arXiv cs.AI · 7d ago

NeSyCat Torch: Differentiable Tensor Implementation for Neurosymbolic Learning

NeSyCat Torch provides a differentiable tensor implementation of categorical semantics for neurosymbolic learning, unifying classical, fuzzy, probabilistic, and neural systems under a single inductive truth definition. It outperforms LTN and DeepProbLog in speed and accuracy on MNIST addition, matching DeepStochLog's accuracy while operating within a uniform framework extensible to continuous probability via monad instantiation.

arxiv arXiv cs.CL · 7d ago

RECOM: Validity-Discrimination Tradeoff in Reddit QA Metrics

RECOM evaluates 15,000 r/AskReddit questions with authentic community replies posted after model training. It shows no automatic metric simultaneously achieves strong validity and discriminative power, with BERTScore ranking models weakly even when length is controlled. The tradeoff arises from representation design, not model differences, and requires reporting both validity and discrimination with random-baseline floors.

arxiv arXiv cs.CL · 7d ago

DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning

DreamReasoner-8B is an open-source block diffusion model that demonstrates strong long-chain-of-thought reasoning. A systematic study shows that small training block sizes preserve reasoning effectiveness, while large sizes degrade performance. Block-size curriculum learning gradually transitions training from fine to coarse blocks, enabling robust and generalizable reasoning across inference settings, with results competitive to Qwen3-8B on mathematical and code benchmarks.

arxiv arXiv cs.CL · 7d ago

LOCUS: A Local Ordinance Corpus for the United States

LOCUS provides machine-readable access to nearly all publicly available U.S. municipal and county ordinance codes, covering 9,239 cities and counties. It includes a county-harmonized access layer for 2,309 of 3,144 U.S. counties, serving the majority of the population. The corpus, built with OCR and metadata for reproducibility, enables large-scale analysis of local law, including dimensions like opacity and paternalism, using ModernBERT-based models.

arxiv arXiv cs.CL · 7d ago

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

PragReST is a self-supervised framework that enhances large language models' pragmatic reasoning by generating counterfactual reasoning traces and training via supervised fine-tuning and reinforcement learning. It outperforms baseline models on four pragmatic benchmarks, improving Qwen3-8B and Qwen3-14B by 5.37% and 5-5.50% accuracy respectively, and maintains strong performance on general-knowledge and mathematical reasoning tasks.

arxiv arXiv cs.CL · 7d ago

Misfired Alignment in LLMs: A Quantitative Study

A new study introduces VETO, a benchmark of 2,032 BBQ-derived contrastive pairs, to quantify misfired alignment in large language models. It defines the Misfired Alignment Rate (MAR) and finds that all benchmarked LLMs exhibit MARs between 4.7% and 18.9%, while human participants achieve 0%. The research shows alignment cues can amplify these failures, with evidence suppression occurring in late layers of models and emerging after instruction training.

arxiv arXiv cs.CL · 7d ago

TW-LegalBench: Evaluating LLMs on Taiwanese Law

TW-LegalBench introduces a benchmark using Taiwan's public legal corpus to assess large language models' performance in Taiwanese law. It includes 16,000+ multiple-choice questions, 117 open-ended essay questions with scoring rubrics, and 14,000+ judgment prediction instances. Evaluation shows top models exceed lawyer passing thresholds (11%) but fall short of judge/prosecutor levels (1-2%), and struggle with precise legal article citations in sentencing predictions.