Mistral AI — korshunov.ai

Lab · Mistral AI

Mistral Releases OCR 4 with Multilingual Support and Structured Output

Mistral OCR 4 introduces bounding boxes, block classification, and inline confidence scores for 170 languages across 10 language groups. It outperforms leading OCR systems in human preference evaluations with a 72% win rate and achieves the top score on OlmOCRBench (85.20), while offering self-hosted deployment in a single container and supporting enterprise use cases like RAG and document ingestion.

lab Mistral AI News · 14h ago

New Connector Controls for Enterprise Security and Access

Mistral Studio now offers enriched admin controls to govern connector access per workspace and tool, enabling fine-grained permissions. Features include API keys with scopes, multi-account connectors, and a new Connectors Debugger for root cause analysis, all supporting secure, auditable integration with enterprise systems.

arxiv arXiv cs.CL · 2d ago

OpenBioRQ: Benchmark for Agentic Biomedical Research Faithfulness

OpenBioRQ introduces a benchmark of 12,553 unsolved biomedical research questions across 12 domains, designed to test agentic models' faithfulness and abstention. It evaluates models in a tool-using setting without answer keys, using real follow-up evidence rather than parametric knowledge, and reveals significant agentic collapse on the hardest questions where tools are no longer used despite being critical.

arxiv arXiv cs.AI · 6d ago

Efficient and Sound Probabilistic Verification for AI Agents

A new framework enables secure, probabilistic policy enforcement for AI agents in ambiguous environments. It uses distributionally robust optimization to compute rigorous upper bounds on policy violation probabilities without assuming predicate independence. The method outperforms prior approaches on terminal and tool calling agent benchmarks, improving the security-utility trade-off.

arxiv arXiv cs.AI · 6d ago

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

LedgerAgent introduces a structured ledger to maintain task states separately in tool-calling agents. It renders states into prompts and enforces policy constraints before tool execution, reducing policy violations and improving performance across customer-service domains.

arxiv arXiv cs.LG · 6d ago

Marginal Advantage Accumulation for Memory-Driven Agent Self-Evolution

This paper introduces Marginal Advantage Accumulation (MAA), a post-processing architecture that addresses cross-batch inconsistency in memory-driven agent self-evolution. MAA formalizes alignment and comparability as structural conditions, uses differential signals and exponential moving average to accumulate signed evidence per operation, and ensures traceability via semantic identity merging. It outperforms batch-level baselines in 14 out of 16 settings and reduces token consumption by about 75%.

arxiv arXiv cs.CL · 6d ago

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

LedgerAgent introduces a structured ledger to maintain task states separately in tool-calling agents. It renders these states into prompts and enforces policy constraints before tool execution, reducing policy violations and improving performance across customer-service domains.

arxiv arXiv cs.AI · 6d ago

Dual-Agent Framework for Cross-Model Verified Translation

A dual-agent framework converts natural-language experiment protocols into executable commands for robotic lab platforms. It uses a Parser Agent and a rule-based mapping engine to translate protocols, with a heterogeneous LLM Validation Agent ensuring accuracy and triggering self-correction. The framework successfully enables end-to-end autonomous execution of microplate-based experiments like the Bradford assay.

arxiv arXiv cs.AI · 6d ago

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

AutoPass uses runtime and compiler evidence to guide LLM-generated optimization decisions, outperforming expert heuristics and classical autotuning methods. It achieves geometric-mean speedups of 1.043x on x86-64 and 1.117x on ARM64 systems without prior training or fine-tuning.

arxiv arXiv cs.LG · 6d ago

LLM-Generated GPU Kernels Face Correctness Illusion

Benchmarks using fixed-shape checks miss real bugs in LLM-generated GPU kernels. A controlled corpus of 24 kernels, including 9 buggy variants with transcription errors, reveals that an op-schema-aware oracle detects all failures and passes all correct controls, with identical results across five GPU architectures.

arxiv arXiv cs.CL · 6d ago

AgentFinVQA: Auditable, On-Premise Financial Chart QA

AgentFinVQA introduces a multi-agent pipeline for financial chart question answering that ensures auditability and on-premise deployability without significant accuracy loss. It outperforms baseline models by +7.68 pp using a proprietary backbone and +4.84 pp with open-weights Qwen3.6-27B-FP8, while providing a confidence signal via verifier output that improves human review routing.

arxiv arXiv cs.CL · 6d ago

AtomMem: Simple and Effective Memory System for LLM Agents

AtomMem introduces a memory system that stores high-value atomic facts from long-form interactions. It uses hierarchical event structures and temporal profiles to capture coherent episodic contexts and track evolving user attributes, enabling stable and efficient memory evolution. Experiments on the LoCoMo benchmark show AtomMem achieves state-of-the-art performance in reasoning tasks.

arxiv arXiv cs.CL · 6d ago

GEMS: Geometric Constraints Enable Multi-Semantic Superposition in LLMs

GEMS enables training-free superposition of multiple semantic directions in LLMs by addressing distributional deviation and directional interference through geometric constraints. On GSM8K, it maintains 98% accuracy with three non-mathematical directions, while unconstrained addition drops to 4%; on Wikitext-2, it increases PPL by only 2.2%.

arxiv arXiv cs.CL · 6d ago

Causal Activation Directions for Mitigating Emergent Misalignment in Language Models

Fine-tuning language models on insecure code causes emergent misalignment. A shared activation direction across four model families achieves 99.6% separation of aligned and misaligned activations, and subtracting it reduces code spillover by 21-51 points. Cross-architecture transfer shows behavioral suppression but lacks specificity, with within-model directions being causally actionable and cross-model directions only causally real.

arxiv arXiv cs.LG · 7d ago

Discriminator-Guided RL Corrects Flow Matching with Data-Aligned Rewards

Discriminator-Guided RL (DRL) uses a pretrained representation space to train a discriminator that separates real data from model-generated samples. Its logit is used as a reward in KL-regularized RL, aligning model outputs with visual and semantic realism without human preferences. DRL improves FID and semantic FD across models like SiT and JiT, and enhances the Pareto frontier between preference and fidelity.

arxiv arXiv cs.LG · 7d ago

TxBench-PP: AI Agent Performance in Preclinical Pharmacology

TxBench-PP is a verifiable benchmark for small-molecule preclinical pharmacology, testing AI agents' ability to derive accurate conclusions from real-world assay data. Across 16 model-harness configurations, no system reliably made correct preclinical pharmacology decisions, with the best performance at 59.3% (Claude Opus 4.8 / Pi) and 55.3% (GPT-5.5 / Pi) of endpoint attempts.

arxiv arXiv cs.AI · 7d ago

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

OneCanvas enables 3D scene understanding in Vision-Language Models by aggregating patch features onto a panoramic canvas using 3D world coordinates. It achieves state-of-the-art results on SQA3D and VSI-Bench, with strong generalization on SPBench, using significantly less training compute than prior methods.

arxiv arXiv cs.CL · 7d ago

OmniAgent: Native Active Perception for Omni-Modal Understanding

OmniAgent introduces a POMDP-based iterative Observation-Thought-Action cycle for video understanding, enabling on-demand action execution to selectively distill audio-visual cues into persistent textual memory. It achieves state-of-the-art performance on ten benchmarks, with a 7B agent outperforming a 10× larger Qwen2.5-VL-72B model on LVBench (50.5% vs. 47.3%).

arxiv arXiv cs.LG · 7d ago

Unsupervised Reward Optimization for Protein Language Models

A new framework enables protein language models to generate controllable protein sequences without labeled data or wet-lab validation. It uses task-agnostic rewards based on model uncertainty and semantic consistency to guide generation, with Soft and Binarized Reward Optimization outperforming baselines in coverage and controllability across diverse conditions.

arxiv arXiv cs.LG · 7d ago

ViGOS: Decoupling Perception and Reasoning in Multimodal On-Policy Self-Distillation

ViGOS introduces a visually grounded on-policy self-distillation framework for multimodal large language models. It decouples perception and reasoning by using an image-only teacher for visual descriptions and a reasoning teacher for final outputs, reducing reliance on text-only references. This approach improves image-grounded performance across multiple vision-language benchmarks.