Anthropic has redeployed Claude Fable 5 globally and published detailed information regarding its cybersecurity safety classifiers and a proposed AI jailbreak severity framework. The company aims to establish consistent terminology for discussing jailbreak risks with governments while inviting feedback from the broader community.

  • Safety classifiers categorize cybersecurity uses into four groups: prohibited, high-risk dual use, low-risk dual use, and benign.
  • Prohibited actions include ransomware, cyber-physical sabotage, malware development, and internet backbone attacks due to their high potential for harm.
  • High-risk dual use activities, such as penetration testing and exploit development, are currently blocked pending better access controls for authorized actors.
  • A new jailbreak severity framework is proposed to help developers and governments describe the risks posed by different types of AI jailbreaks consistently.

This initiative seeks to spark discussion across academia, industry, and government to define standards that enable defensive technology use while preventing misuse.