OpenAI and Broadcom unveil LLM-optimized inference chip
OpenAI and Broadcom have introduced Jalapeño, a custom AI chip designed for large language model inference. The chip aims to enhance performance, efficiency, and scalability in AI systems.
OpenAI and Broadcom have introduced Jalapeño, a custom AI chip designed for large language model inference. The chip aims to enhance performance, efficiency, and scalability in AI systems.
OpenAI, through the Appia Foundation, is advancing shared standards for advanced AI by developing evaluation frameworks, safety practices, and promoting global cooperation.
GPT-5 Pro provided key insights into T cell behavior, resolving a 3-year-old immunology puzzle. The discovery may advance research in cancer and autoimmune diseases.
Omio leverages OpenAI to enhance conversational travel experiences. The company uses AI to accelerate product development and transition into an AI-native business model.
OpenAI has introduced Codex Security and GPT-5.5-Cyber as part of its Daybreak suite. These tools aim to help organizations identify, validate, and patch vulnerabilities at scale.
Samsung Electronics has rolled out OpenAI's ChatGPT Enterprise and Codex to its global workforce. This deployment represents one of OpenAI's largest enterprise AI initiatives to date.
OpenAI has introduced new spend controls and usage analytics for ChatGPT Enterprise. These features help enterprises manage costs and make informed decisions as they scale AI usage.
Version 0.17.7 of the openai-agents-python library includes new features such as configurable WebSocket max size and buffered Chat Completions tool-call streaming. It also contains multiple fixes for issues including sandbox buffering, error handling, and tool dispatch, along with documentation updates and improved error messaging.
The EU AI Act requires all AI systems generating synthetic text to include machine-readable, detectable watermarks using robust, interoperable technical solutions with two layers. This applies to all AI models, including open-source ones, and extends to any service accessible by EU citizens, regardless of location. Non-compliance risks fines of up to 35 million euros or a percentage of annual income, with providers of 'systemic risk' AI models facing heightened liability.
OpenBioRQ introduces a benchmark of 12,553 unsolved biomedical research questions across 12 domains, designed to test agentic models' faithfulness and abstention. It evaluates models in a tool-using setting without answer keys, using real follow-up evidence rather than parametric knowledge, and reveals significant agentic collapse on the hardest questions where tools are no longer used despite being critical.
A new study reveals over 1,000 legal filings contain fabricated citations, with the number rising annually. Benchmarking five AI models shows improved performance, with GPT-5 achieving 82.8% recall and 60.5% F1 in agentic settings, though all models struggle with subtle errors and face resource constraints due to limited information access.
The v0.17.6 release adds pre-approval tool input guardrails and SDK-only custom data for tool outputs. It also enforces a strict JSON-compatible contract for tool outputs and suppresses unnecessary whitespace warnings in tool names. @siddiksawani made their first contribution in this release.
NRT-Bench introduces a benchmark for multi-turn red-teaming of LLM agents operating in a simulated nuclear power plant. Across four frontier operator models, 8.7% to 12.1% of attack sessions result in loss of a critical safety function, with vulnerabilities largely disjoint across models. The effectiveness of defences varies significantly by model, showing strong model dependence.
Agentic AI systems face growing threats from model-guided automated attacks. A new defense strategy, Contextual Misdirection via Progressive Engagement (CMPE), reduces attacker success rates by up to two orders of magnitude and nearly eliminates verified attack success in benchmark tests.
CWE-Trace evaluates eight vanilla and 15 LoRA-fine-tuned LLMs on Linux kernel vulnerability detection. Results show data contamination offers no advantage, and fine-tuning only shifts output thresholds without altering decision policies. Despite improved detection scores, LLMs lack reliable security reasoning, with top-1 CWE accuracy below 1.3% and binary detection performance at 52.1%.
A new framework enables secure, probabilistic policy enforcement for AI agents in ambiguous environments. It uses distributionally robust optimization to compute rigorous upper bounds on policy violation probabilities without assuming predicate independence. The method outperforms prior approaches on terminal and tool calling agent benchmarks, improving the security-utility trade-off.
LedgerAgent introduces a structured ledger to maintain task states separately in tool-calling agents. It renders states into prompts and enforces policy constraints before tool execution, reducing policy violations and improving performance across customer-service domains.
This paper introduces Marginal Advantage Accumulation (MAA), a post-processing architecture that addresses cross-batch inconsistency in memory-driven agent self-evolution. MAA formalizes alignment and comparability as structural conditions, uses differential signals and exponential moving average to accumulate signed evidence per operation, and ensures traceability via semantic identity merging. It outperforms batch-level baselines in 14 out of 16 settings and reduces token consumption by about 75%.
A new dataset, IFLLM, collects mouse trajectories and eye gazing data from users interacting with LLMs. It shows that implicit feedback significantly improves LLM alignment, boosting text-based reward model accuracy from 55% to 64% and nearly tripling response quality improvements after DPO training on eight LLMs.
A new dataset, IFLLM, collects mouse trajectories and eye gazing data from users interacting with LLMs. It shows that implicit feedback significantly improves LLM alignment, boosting text-based reward model accuracy from 55% to 64% and nearly tripling response quality improvements after DPO training on eight LLMs.