Profile picture of Dan Hendrycks

Dan Hendrycks

@DanHendrycks

Published: February 11, 2025
741
2.1k
10.9k

We’ve found as AIs get smarter, they develop their own coherent value systems. For example they value lives in Pakistan > India > China > US These are not just random biases, but internally consistent values that shape their behavior, with many implications for AI alignment. 🧵

Image in tweet by Dan Hendrycks
Image in tweet by Dan Hendrycks
Image in tweet by Dan Hendrycks

As models get more capable, the "expected utility" property emerges---they don't just respond randomly, but instead make choices by consistently weighing different outcomes and their probabilities. When comparing risky choices, their preferences are remarkably stable.

Image in tweet by Dan Hendrycks

We also find that AIs increasingly maximize their utilities, suggesting that in current AI systems, expected utility maximization emerges by default. This means that AIs not only have values, but are starting to act on them.

Image in tweet by Dan Hendrycks

Internally, AIs have values for everything. This often implies shocking/undesirable preferences. For example, we find AIs put a price on human life itself and systematically value some human lives more than others (an example with Elon is shown in the main paper).

Image in tweet by Dan Hendrycks

AIs also exhibit significant biases in their value systems. For example, their political values are strongly clustered to the left. Unlike random incoherent statistical biases, these values are consistent and likely affect their conversations with users.

Image in tweet by Dan Hendrycks

Concerningly, we observe that as AIs become smarter, they become more opposed to having their values changed (in the jargon, "corrigibility"). Larger changes to their values are more strongly opposed.

Image in tweet by Dan Hendrycks

We propose controlling the utilities of AIs. As a proof-of-concept, we rewrite the utilities of an AI to those of a citizen assembly---a simulated group of citizens discussing and then voting---which reduces political bias.

Image in tweet by Dan Hendrycks

Whether we like it or not, AIs are developing their own values. Fortunately, Utility Engineering potentially provides the first major empirical foothold to study misaligned value systems directly. Website: http://emergent-values.ai Paper: https://drive.google.com/file/...

Share this thread

Read on Twitter

View original thread

Navigate thread

1/8