Cohere — korshunov.ai

Lab · Cohere

JAMER: Project-Level Code Framework Dataset and Benchmark

JAMER introduces JamSet and JamBench, the first project-level game code dataset and benchmark on a professional game engine. Built from 8,133 verified Game Jam projects, it enables deterministic evaluation and reveals a capability cliff in AI models as project scale increases, with runtime pass rates dropping from 80.4% to 5.7%.

arxiv arXiv cs.CL · 8d ago

LegalHalluLens: Auditing Hallucinations in Legal AI

LegalHalluLens introduces a framework to audit AI hallucinations in legal contexts by analyzing typed hallucination profiles across four claim categories. It reveals a 38-40 point gap between obligation/numeric and temporal claims, and shows two systems with identical 52% hallucination rates can have opposite risk directions. The framework uses a Risk Direction Index and calibrated debate pipelines to reduce fabricated detections by 45% and improve accountability in legal AI deployment.

arxiv arXiv cs.CL · 8d ago

SkillWeaver: Compositional Skill Routing for LLM Agents

SkillWeaver introduces a decompose-retrieve-compose framework for LLM agents, formalizing the Compositional Skill Routing problem. It achieves 67.7% decomposition accuracy via Iterative Skill-Aware Decomposition (SAD), improving from 51.0% with a p-value of less than 10^-6, and reduces context window usage by over 99%.

arxiv arXiv cs.LG · 9d ago

Fingerprinting agent behavior through procedural trajectories

We introduce a method to identify agents by their procedural behavior fingerprints, achieving 85.7% accuracy in attributing unseen trajectories to correct agents. Using ProcGrep, we analyze coding agent behavior in SWE-Bench, finding that models from similar release periods or distilled from each other exhibit closer behavioral similarity, with a Jensen-Shannon divergence of 0.25.

media AI News (smol.ai) · 4d ago

GLM-5.2 Breakout and Open-Model Progress Highlighted

Zhipu's GLM-5.2 emerged as the top open-weight model, praised for its frontier-adjacent performance in daily use, with improvements in coding tasks and reduced 1M-token inference cost via IndexShare. It outperformed other open models in agentic knowledge work benchmarks, reaching 1266 Elo in Artificial Analysis' AA-Briefcase test, though only 3% of tasks were fully satisfied by top models, indicating persistent challenges in real-world long-horizon agent performance.

arxiv arXiv cs.AI · 6d ago

Trajectory Mining Reveals Skill Structure but Fails to Improve Policies

A three-stage pipeline mines skill libraries from GUI interaction data, achieving high purity in five of eight clusters against InteraSkill labels. However, the method only slightly improves skill-step accuracy on IW and fails to advance performance on BrowseComp+ or key metrics, indicating limitations in cross-domain policy transfer.

arxiv arXiv cs.LG · 6d ago

Training LLMs for Long-Lifecycle Agents via Cross-Domain Generalization

A new framework enables large language models to develop 'Connect the Dots' capability, allowing long-lifecycle agents to learn from experiences and iteratively update their environment context. The framework uses reinforcement learning with long rollout sequences and custom tasks to promote cross-domain generalization, showing effective out-of-distribution performance in both domains and transition settings.

arxiv arXiv cs.CL · 6d ago

Zero-Shot Agentic LLMs Extract Lung Pathology from Narratives

A zero-shot agentic workflow using open-source LLMs extracts 13 College of American Pathologists synoptic fields from lung resection pathology reports. The best model (GPT-OSS-20B) achieved a Micro-F1 of 0.893, outperforming baseline recall and accurately capturing complex pathologic relations without task-specific training.

arxiv arXiv cs.CL · 6d ago

Tool-Intent Stabilization in Streaming RAG

A study measures tool-intent stabilization in Streaming RAG, defining when speculative tool queries converge to correct answers. On the CRAG benchmark, 73.9% of queries allow substantial latency hiding, with early stabilization observed in questions with verbatim retrievable evidence. Question type significantly predicts early versus late stabilization, informing when speculative triggers are effective.

media r/LocalLLaMA · 6d ago

North Mini Code: 4-bit quant, Ollama, and OpenRouter support

Cohere Labs has released a 4-bit quantized version of North Mini Code on Hugging Face, reducing its size to approximately 20GB for local execution on devices like Macs. The model is now supported in Ollama, local runtimes based on llama.cpp, and via the OpenRouter API, improving accessibility for developers.

arxiv arXiv cs.AI · 7d ago

CAPRA: Multi-Agent LLM System for Software Architecture Feedback

CAPRA is a multi-agent LLM system that generates personalized, template-compliant LaTeX feedback on software architecture deliverables. It uses specialized agents, PyMuPDF, and gpt-4o to extract and analyze text and UML diagrams, with evidence anchoring and consistency management to ensure reliability. A preliminary evaluation of 10 student reports shows CAPRA met 88.8% of eight criteria and achieved moderate inter-rater agreement (kappa = 0.582), with each report processed in under 4 minutes.

arxiv arXiv cs.LG · 8d ago

ReproRepo: Scalable Reproducibility Audits with GitHub Issues

ReproRepo introduces a scalable framework using GitHub issues to evaluate ML paper reproducibility. It shows that LLM agents like Codex with GPT-5.5 identify at least one human-reported blocker in 90% of 1,149 ML papers, highlighting their ability to detect visible failures and semantic issues, though exact localization remains limited.

arxiv arXiv cs.CL · 8d ago

ReproRepo: Scaling Reproducibility Audits with GitHub Issues

ReproRepo introduces a scalable framework using GitHub issues to evaluate ML paper reproducibility. It shows that LLM agents like Codex with GPT-5.5 identify at least one semantically related blocker in 90% of paper-repository pairs without executing code.

arxiv arXiv cs.AI · 8d ago

ALeRCE Launches Text-to-SQL System with LLMs

The ALeRCE astronomical database introduces a text-to-SQL system using large language models, enabling natural language queries to generate executable SQL. The system, evaluated on 110 NL/SQL pairs, uses a step-by-step framework that outperforms direct-inference baselines, with Claude Opus 4.6 achieving high precision on simple queries and among the best overall performance across evaluated models.

arxiv arXiv cs.AI · 8d ago

ReproRepo: Scaling Reproducibility Audits with GitHub Issues

ReproRepo introduces a scalable framework using GitHub issues to evaluate ML paper reproducibility. It shows that LLM agents like Codex with GPT-5.5 identify at least one blocker in 90% of paper-repository pairs without executing code, though exact localization remains challenging.

arxiv arXiv cs.CL · 8d ago

SwiftTrans Improves LLM Code Translation Efficiency

SwiftTrans addresses runtime efficiency gaps in LLM-based code translation by introducing Multi-Perspective Exploration and Difference-Aware Selection. The framework extends CodeNet, F2SBench, and introduces SwiftBench to evaluate runtime performance, showing consistent improvements in both correctness and efficiency across benchmarks.

arxiv arXiv cs.CL · 8d ago

GameCraft-Bench: Evaluating End-to-End Game Generation

GameCraft-Bench introduces a benchmark with 140 Godot tasks across 15 game families to assess coding agents' ability to generate playable games. Evaluations show the best agent achieves only 41.46% success, indicating significant challenges in producing complete, interactive games with coherent gameplay and visual feedback.