February 16, 2025

Hackers Bypass Safeguards in Anthropic AI Model Challenge

Listen to this article as Podcast
0:00 / 0:00
Hackers Bypass Safeguards in Anthropic AI Model Challenge

Anthropic's AI Model Security Measures Under Fire: Hackers Overcome Protective Mechanisms

The security measures of Anthropic's AI model, Claude, were put to the test in a public challenge – with success for the hackers. The test, which took place from February 3rd to 10th, 2025, aimed to check the robustness of the "Constitutional Classifiers," a security technology designed to protect AI language models from manipulation.

The principle of "Constitutional Classifiers" is based on predefined rules that determine which content is permissible and which is prohibited. Using this "constitution," synthetic training data is generated in various languages and styles. This data is then used to train classifiers that are supposed to detect suspicious inputs.

During the test, over 300,000 messages were sent to Claude, equivalent to an estimated total time of 3,700 hours. Four participants managed to overcome all security levels of the test. One participant even discovered a universal jailbreak – essentially a master key to bypass Claude's security precautions. Anthropic paid a total of $55,000 in prize money to the winners.

After just six days, one participant succeeded in breaking through all eight security levels of the test. This success underscores the difficulty of effectively protecting AI models from manipulation, particularly regarding universal jailbreaks, which can disable all security measures at once.

Jan Leike, a researcher at Anthropic, emphasized that the test results show that security classifiers alone are not sufficient to fully protect AI models. The probabilistic nature of these models makes securing them a particular challenge. As the power of these models increases, robustness against jailbreaking becomes a central security aspect to prevent misuse related to chemical, biological, radiological, and nuclear risks.

Even before the public test, Anthropic had conducted internal tests with 183 participants over two months. Despite a prize of $15,000 and around 3,000 test hours, no participant managed to bypass all security measures. The initial version of the "Constitutional Classifiers" had two main drawbacks: it classified too many harmless requests as dangerous and required too much computing power. An improved version partially addressed these issues, but some challenges remained.

Automated tests with 10,000 jailbreak attempts showed that the protected version of Claude could block over 95 percent of manipulation attempts, while an unprotected model allowed 86 percent of the attacks. The improved version incorrectly classified only 0.38 percent of harmless requests as dangerous but required 23.7 percent more computing power.

Anthropic acknowledges that the system is not immune to every universal jailbreak and that new attack methods could emerge that it cannot handle. Therefore, the company recommends using the "Constitutional Classifiers" in combination with other security measures. The public test highlights the need for further research and development in AI security to minimize the risks of manipulation and misuse.

Sources: - https://www.reddit.com/r/singularity/comments/1iqo69q/claude_jailbreak_results_are_in_and_the_hackers/ - https://the-decoder.com/claude-jailbreak-results-are-in-and-the-hackers-won/ - https://www.anthropic.com/research/constitutional-classifiers - https://www.youtube.com/watch?v=m5uWKRJhcao - https://bgr.com/tech/anthropic-dares-you-to-try-to-jailbreak-claude-ai/ - https://www.zdnet.com/article/anthropic-offers-20000-to-whoever-can-jailbreak-its-new-ai-safety-system/ - https://promptengineering.org/anthropics-constitutional-classifiers-vs-ai-jailbreakers/ - https://www.techzine.eu/news/applications/128391/anthropic-challenges-users-to-jailbreak-ai-model/ - https://adversa.ai/blog/ai-red-teaming-reasoning-llm-jailbreak-china-deepseek-qwen-kimi/ - https://www.facebook.com/groups/chatgpt4u/posts/1619556022007434/