BERTomelo: Your Portuguese Encoder Best Friend
This article introduces BERTomelo, a next-generation monolingual encoder specifically optimized for the Portuguese language using the ModernBERT architecture.
This article introduces BERTomelo, a next-generation monolingual encoder specifically optimized for the Portuguese language using the ModernBERT architecture.
The authors adapt the open-source IndicTrans2-1B translation system to handle conversational register across 21 Indic languages using only public datasets. By combining experience replay with model souping, they achieve significant improvements in automatic metrics without degrading performance on general domain tasks.
A study of 22 open-weight large language models reveals that while the strength of clinical evidence can be recovered from model activations and text, the grades explicitly stated by the models are no better than chance. Researchers analyzed 45,134 clinical claims harmonized into four-level evidence grades to test whether models register and express evidence strength distinct from factual truth.
Researchers investigate the distributional gap between synthetic and real speech in LLM-based automatic speech recognition (ASR) systems by probing a SLAM-ASR architecture. They identify that discriminative signals separating the two data types are concentrated in the early-to-middle layers of the model backbone.
This paper introduces a continuous decoding framework for masked diffusion language models (MDLMs) that reinterprets mask prediction as clean-state prediction to induce a continuous flow in input embedding space. By allowing tokens to accumulate partial progress and remain revisable, the method addresses the premature commitments inherent in standard binary unmasking regimes.
ThinkProbe is a framework for the structural analysis of large language model reasoning traces, converting them into directed Thought Graphs with eight node types and six edge types. It derives a 19-metric five-dimensional cognitive profile through a fully non-generative pipeline combining rule-based segmentation and discriminative semantic linking.
This study investigates the extent to which modern text encoders capture psychological theories of affect by evaluating twelve recently released models across three established emotion frameworks. The research compares word-level and sentence-level performance using both regression and classification tasks.
This study evaluates whether mid-scale Multimodal Large Language Models (MLLMs) can perform localized concept naming under strict zero-shot conditions by assigning labels to bounding-box regions. The authors propose a reproducible evaluation protocol for Concept Naming that includes closed-set prompting and an embedding-similarity-based strategy for large label spaces.
Researchers introduce Evolution Fine-Tuning (EFT), a mid-training paradigm that teaches Large Language Models to evolve solutions across diverse tasks by converting evolutionary search trajectories into supervision. This approach addresses the limitation of prior methods that discard accumulated experience, enabling models to reuse discovery capabilities rather than solving new problems from scratch.
AB-RAG is a training-free, backbone-agnostic framework that dynamically adjusts retrieval efforts based on a confidence estimate derived from model certainty, answer-evidence agreement, and retrieval score variance. This approach allows systems to decide whether to stop or retrieve more evidence within a fixed budget without retraining the underlying language model.
This study investigates whether language models recognize when they are being tested, a factor critical for AI safety as it may cause models to alter their behavior strategically. Using 11 open-weight models from the Qwen 2.5, Gemma 2, and Llama 3.2 families, researchers analyzed how evaluation awareness manifests across different model sizes.
The authors introduce a pre-registered screening rule that determines before implementation whether an evolutionary outer loop over neural network parameters is worth building compared to a cheap single-shot alternative. The rule calculates a recovery metric R, defined as the best single-shot gain divided by the best gain of any cheap method, and prescribes skipping the outer loop when R is greater than or equal to 90%.
A study involving 815 participants examined whether using human-like language to describe artificial intelligence alters public perception compared to neutral descriptions.
The authors present DistilledGemma, an efficient system for person-place relation extraction from multilingual historical newspaper articles in English, German, and French. The approach utilizes a three-stage knowledge distillation pipeline to balance classification accuracy with computational efficiency.
The authors introduce Symbolic Mechanistic Data Attribution (SMDA), a framework that attributes training pairs to the interpretable symbolic policies governing model behavior, bridging the gap between mechanistic circuits and high-level decisions.
The article introduces TraceRetain, a lightweight framework for bounded external memory in frozen LLM agents that scores and evicts entries based on interpretable features like success and redundancy. The study evaluates how retention policies impact performance when external memory is used to augment language models.
The article addresses the limitation of AutoDiscovery's use of static "Bayesian surprise" by introducing evidence-informed LLM beliefs, where priors are updated with evidence from previous hypotheses to compute non-stationary surprisal. The authors find that embedding-based retrieval-augmented generation over prior discoveries best anticipates eventual posteriors and identify 37.5% of static surprisals as spurious.
A study benchmarks ten OCR systems on Devanagari text, revealing that specialized OCR vision-language models are fragile under degradation and that strong English performance does not predict Indic script accuracy.
Researchers propose Multi-Block Diffusion Language Models (MBD-LMs) to extend Single-Block diffusion text generation by decoding a running-set of consecutive blocks concurrently for inter-block parallelism. The approach bridges the gap between training and inference states through a post-training method called Multi-block Teacher Forcing (MultiTF).
Researchers introduce PolicyGuard, a sub-agent verifier designed to improve policy adherence in LLM agents by reasoning over the full dialogue context rather than relying on external checks of individual arguments. This approach addresses the limitations of prior safeguarding methods that often underestimate the need for conversation-specific remediation and explicit user confirmation.