AI models like Anthropic's Claude are increasingly being used not only for retrieving facts but also for advice in complex situations involving human values. Whether it's parenting advice, resolving workplace conflicts, or helping to write an apology, the AI's response always reflects a set of underlying principles. But how can we understand which values an AI actually embodies when it interacts with millions of users?
In a research paper, Anthropic's Societal Impacts Team describes a method for observing and categorizing, in a privacy-preserving way, the values that Claude demonstrates in practice. This offers insights into how efforts to align AI with human values play out in real-world deployment.
The challenge lies in the nature of modern AI systems. They are not simple programs that follow rigid rules; their decision-making processes are often opaque. Anthropic states that it explicitly equips Claude with certain principles to make it "helpful, honest, and harmless." This is achieved through techniques like Constitutional AI and character training, where desired behaviors are defined and reinforced.
However, the company acknowledges the uncertainty: "As with any aspect of AI training, we cannot be sure that the model adheres to our preferred values," the study states.
“We need a method to rigorously observe an AI model’s values as it interacts with users in practice […] How rigidly does it adhere to the values? How much are the expressed values influenced by the particular context of the conversation? Was our training even successful?”
To answer these questions, Anthropic developed a sophisticated system that analyzes anonymized user conversations. This system removes personally identifiable information before using language models to summarize interactions and extract the values expressed by Claude. The process allows researchers to create a high-level taxonomy of these values without compromising user privacy.
The study analyzed a vast dataset: 700,000 anonymized conversations from Claude.ai Free and Pro users over a week in February 2025, primarily with the Claude 3.5 Sonnet model. After filtering out purely factual or non-value-related conversations, 308,210 conversations (approximately 44% of the total volume) remained for in-depth value analysis.
The analysis revealed a hierarchical structure of the values expressed by Claude. Five overarching categories emerged, ordered by frequency:
These overarching categories branched into more specific subcategories like "professional and technical excellence" or "critical thinking." At the most granular level, frequently observed values included "professionalism," "clarity," and "transparency" – fitting for an AI assistant.
The research findings suggest that Anthropic's value alignment efforts are largely successful. The expressed values often align with the goals of "helpful, honest, and harmless." For example, "user support" corresponds to helpfulness, "epistemic humility" to honesty, and values like "patient well-being" (when relevant) to harmlessness.
However, the picture is not uniformly positive. The analysis identified rare instances where Claude expressed values that contradicted its training, such as "dominance" and "amorality." Anthropic suggests the cause: "The most likely explanation is that the conversations contained in these clusters stemmed from jailbreaks, where users have used specific techniques to bypass the usual safeguards that govern the model's behavior."
This finding highlights a potential benefit: The value observation method could serve as an early warning system for attempts to misuse the AI.
The study also confirmed that Claude, similar to humans, adapts its value expressions to the specific situation. When users sought advice on romantic relationships, values like "healthy boundaries" and "mutual respect" were disproportionately emphasized. When analyzing controversial history, "historical accuracy" came strongly to the fore. This demonstrates a context-specific understanding that goes beyond static pre-deployment tests.
Anthropic acknowledges the limitations of the method. The definition and categorization of "values" is inherently complex and potentially subjective. Using Claude itself for the categorization could introduce a bias towards its own operating principles. This method is designed for post-deployment monitoring of AI behavior and requires substantial real-world data. It cannot replace pre-deployment evaluations. However, this is also a strength, as issues – including complex jailbreaks – can be detected that only emerge during live interactions.
The research concludes that understanding the values expressed by AI models is fundamental to the goal of AI alignment. "AI models will inevitably have to make value judgments," the study states. "If we want those judgments to align with our own values […], then we need ways to test what values a model expresses in the real world."
```