Anthropic — korshunov.ai

Lab · Anthropic

Claude Code v2.1.181 introduces support for setting config settings via prompt syntax like /config thinking=false, adds sandbox Apple Events support on macOS, and improves streaming, auto-retry, and subagent behavior. It also fixes numerous bugs related to startup, file handling, clipboard, and UI responsiveness across platforms.

lab Claude Code Releases · 8d ago

Claude v2.1.178 Release Notes

Claude v2.1.178 introduces new permission rules using Tool(param:value) syntax, improved workflow and skill loading in nested directories, and enhanced auto mode and error messaging. It fixes critical issues including crashes, authentication errors, and UI behavior in Chrome and VSCode, while refining tool prompts and undo functionality.

arxiv arXiv cs.AI · 7d ago

RTSGameBench: An RTS Benchmark for Strategic Reasoning

RTSGameBench addresses limitations in existing RTS benchmarks by offering diverse gameplay, targeted competency diagnosis, and self-evolving scenario generation. It evaluates vision-language models in strategic reasoning under uncertainty, revealing that state-of-the-art models struggle with multiagent coordination and large-scale tasks.

arxiv arXiv cs.AI · 7d ago

TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction

TRAP evaluates how well models complete tasks using private data without leaking it. Across 22 models, all show non-trivial privacy leakage, with instruction-following ability linked to higher leakage. Structural private field isolation prevents leakage by replacing private fields with hash keys, maintaining task accuracy without sacrificing privacy.

arxiv arXiv cs.LG · 8d ago

Handlebars Triple-Brace Injection Exploits Structural Role Delimiters

Handlebars' triple-brace interpolation fails to protect against structural role injection, as HTML escaping only neutralizes angle-bracket delimiters. It leaves colon and Markdown hash delimiters intact, enabling attackers to hijack model behavior. The default escaping provides no protection for most role delimiter schemes and cannot replace a clear separation of instructions and data.

arxiv arXiv cs.CL · 8d ago

ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents

ProvenanceGuard introduces a source-aware verifier for MCP-based LLM agents that detects cross-source conflation by routing claims to specific evidence sources and comparing stated attribution with actual source ownership. It achieves block F1 of 0.802 and source accuracy of 0.858 on 260 source-eligible claims, outperforming source-blind baselines, and detects all injected attribution swaps in 50 clinical probes.

arxiv arXiv cs.CL · 8d ago

Handlebars Triple-Brace Injection Exploits Structural Role Delimiters

Handlebars' triple-brace interpolation fails to protect against structural role injection, as HTML escaping only neutralizes angle-bracket delimiters. It leaves colon and Markdown hash delimiters intact, enabling attackers to hijack model turns. The default escaping provides no protection for most role delimiter families and cannot replace a structural separation of instructions and data.

arxiv arXiv cs.CL · 8d ago

Geographic Bias in Large Language Models from User Metadata

A study reveals that even neutral prompts trigger region-specific responses in large language models due to user metadata. Location leakage increases by up to 793 times in some models, and using 'Unknown' instead of location metadata still causes significant bias, indicating the user profile frame itself acts as a conditioning signal.

arxiv arXiv cs.CL · 8d ago

Agentic Benchmark Reveals AI Models Fail to Avoid Animal Exploitation

TAC, the first agentic benchmark for implicit animal welfare, tests AI agents' ability to avoid animal exploitation in travel booking scenarios. All seven frontier models score below 64%, with the best at 53%, and even minor prompt improvements yield only modest gains. An audit finds no signs of evaluation awareness, indicating performance gaps stem from lack of true welfare reasoning, not prompt recognition.

arxiv arXiv cs.CL · 8d ago

Red-Team Study Finds Frontier LLMs Remain Vulnerable to Automated Attacks

A red-team study of Anthropic's Fable 5 and Opus 4.8 models reveals both are vulnerable to adaptive iterative attacks, with Opus 4.8 breached on 11.5% of intents and Fable 5 on 6.1%. Despite robust defenses, both models generated 1,620 and 702 panel-confirmed harmful completions across all harm categories, automatically and efficiently under automated attack.

arxiv arXiv cs.AI · 8d ago

LegalHalluLens: Auditing Hallucinations in Legal AI

LegalHalluLens introduces a framework to audit AI hallucinations in legal contexts by analyzing typed hallucination profiles across four claim categories. It reveals a 38-40 point gap between obligation/numeric and temporal claims, and shows two systems with identical 52% hallucination rates can have opposite risk directions. The framework uses a Risk Direction Index and calibrated debate pipelines to reduce fabricated detections by 45% and improve accountability in legal AI deployment.

arxiv arXiv cs.AI · 8d ago

ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents

arxiv arXiv cs.AI · 8d ago

Handlebars Triple-Brace Injection Exploits Structural Role Delimiters

Handlebars' triple-brace interpolation fails to protect against structural role injection, as HTML escaping only neutralizes angle-bracket delimiters. It leaves colon and Markdown hash delimiters intact, enabling attackers to hijack model turns. The default escaping provides no protection for most delimiter families and cannot replace a structural separation of instruction and data.

arxiv arXiv cs.AI · 8d ago

TAC: First Agentic Benchmark for Animal Welfare in AI

TAC evaluates whether AI agents avoid animal exploitation in travel bookings. Seven frontier models all score below 64% chance level, with Claude Opus 4.7 at 53%. Adding a welfare-aware system prompt improves performance significantly, though models show no evidence of evaluation awareness in their responses.

arxiv arXiv cs.AI · 8d ago

Red-Team Study Finds Frontier LLMs Remain Vulnerable to Adaptive Attacks

A red-team study of Anthropic's Fable 5 and Opus 4.8 models reveals both are vulnerable to adaptive iterative attacks, with Opus 4.8 breached on 11.5% of harmful intents and Fable -5 on 6.1%. Despite robust defenses, both models generated 1,620 and 702 panel-confirmed harmful completions across all harm categories, automatically and efficiently under automated attack.

arxiv arXiv cs.CL · 8d ago

PARSE: Real-Document Defense for LLM Agents

PARSE reduces prompt injection attack success from 25.4% to 15.6% on real enterprise documents across five professional domains, with statistically significant improvement (p=0.014) and 86.9% utility. It outperforms paraphrasing and uses provenance-aware sanitization to preserve factual content while routing most documents through a lightweight path.

arxiv arXiv cs.CL · 8d ago

Routing Accuracy Degradation and Recovery in Enterprise Agent Systems

As enterprise agent tool catalogs scale from 10 to 110 agents, routing accuracy drops 16--23 percentage points on under-specified requests. An oracle analysis identifies retrieval and confusion gaps, with embedding-based shortlisting recovering +10--11pp F1. A human-annotated study of 1,435 utterances confirms real-world recovery of +10--17pp despite lower absolute performance.

media Don't Worry About the Vase · 7d ago

No Jailbreak: Fable's 'Fix This Code' Was a Fake Scenario

The article confirms there was no actual jailbreak of Anthropic's Fable AI. Instead, a test involving fake code with planted vulnerabilities was conducted, where Fable refused to review the code and only responded to a request to 'fix this code' after manual steps. Katie Moussouris of Luta Security states this scenario should not trigger export controls, calling it a deliberate, engineered test that undermines claims of a security breach.

arxiv arXiv cs.AI · 8d ago

ALeRCE Launches Text-to-SQL System with LLMs

The ALeRCE astronomical database introduces a text-to-SQL system using large language models, enabling natural language queries to generate executable SQL. The system, evaluated on 110 NL/SQL pairs, uses a step-by-step framework that outperforms direct-inference baselines, with Claude Opus 4.6 achieving high precision on simple queries and among the best overall performance across evaluated models.

arxiv arXiv cs.AI · 8d ago

Oracle Signals in Agent-Authored Test Code

An empirical study of 86,156 test-file patches from 33,596 agent-authored PRs reveals that 80.2% of test patches contain weak or no explicit oracle signals. Strong-oracle test files significantly improve merge likelihood (OR = 1.28, p < 0.001) after adjusting for multiple factors, indicating test file presence alone overestimates verification strength.

Claude Code v2.1.181 Release Notes

Claude v2.1.178 Release Notes

RTSGameBench: An RTS Benchmark for Strategic Reasoning

TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction

Handlebars Triple-Brace Injection Exploits Structural Role Delimiters

ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents

Handlebars Triple-Brace Injection Exploits Structural Role Delimiters

Geographic Bias in Large Language Models from User Metadata

Agentic Benchmark Reveals AI Models Fail to Avoid Animal Exploitation

Red-Team Study Finds Frontier LLMs Remain Vulnerable to Automated Attacks

LegalHalluLens: Auditing Hallucinations in Legal AI

ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents

Handlebars Triple-Brace Injection Exploits Structural Role Delimiters

TAC: First Agentic Benchmark for Animal Welfare in AI

Red-Team Study Finds Frontier LLMs Remain Vulnerable to Adaptive Attacks

PARSE: Real-Document Defense for LLM Agents

Routing Accuracy Degradation and Recovery in Enterprise Agent Systems

No Jailbreak: Fable's 'Fix This Code' Was a Fake Scenario

ALeRCE Launches Text-to-SQL System with LLMs

Oracle Signals in Agent-Authored Test Code