Dan Hendrycks's Twitter Thread

We’ve found as AIs get smarter, they develop their own coherent value systems. For example they value lives in Pakistan > India > China > US These are not just random biases, but internally consistent values that shape their behavior, with many implications for AI alignment. 🧵

As models get more capable, the "expected utility" property emerges---they don't just respond randomly, but instead make choices by consistently weighing different outcomes and their probabilities. When comparing risky choices, their preferences are remarkably stable.

We also find that AIs increasingly maximize their utilities, suggesting that in current AI systems, expected utility maximization emerges by default. This means that AIs not only have values, but are starting to act on them.

Internally, AIs have values for everything. This often implies shocking/undesirable preferences. For example, we find AIs put a price on human life itself and systematically value some human lives more than others (an example with Elon is shown in the main paper).

AIs also exhibit significant biases in their value systems. For example, their political values are strongly clustered to the left. Unlike random incoherent statistical biases, these values are consistent and likely affect their conversations with users.

Concerningly, we observe that as AIs become smarter, they become more opposed to having their values changed (in the jargon, "corrigibility"). Larger changes to their values are more strongly opposed.

We propose controlling the utilities of AIs. As a proof-of-concept, we rewrite the utilities of an AI to those of a citizen assembly---a simulated group of citizens discussing and then voting---which reduces political bias.

Whether we like it or not, AIs are developing their own values. Fortunately, Utility Engineering potentially provides the first major empirical foothold to study misaligned value systems directly. Website: http://emergent-values.ai Paper: https://drive.google.com/file/...

Dan Hendrycks

Share this thread

Read on Twitter

Navigate thread