Joscha Bach's Twitter Thread

OpenAI has spent incredible effort at building guardrails for safety, reliability and political correctness into ChatGPT, but it can easily (and hilariously be "jailbroken" by using clever prompts that sneak through its defenses.

Jailbreaking of LLMs could be remedied by using an additional filter with less context (check output in small and large parts), but replacing the default output with prefabricated evasions is producing a kind of AI safety that is indistinguishable from publication bias.

The ugliness of many current "AI safety" measures of LLM applications does not mean that stringent and reliable safety and context constraints are not important: imagine we want to use an LLM in schools, smart home applications or for psychotherapeutic purposes!

To make models safe that are built by analyzing vastly more data than entire scientific disciplines can process, we must ultimately rely less on the power of safety wheels and automatic brakes, but on the ability of the AI to understand who is speaking, and to whom.

Despite its assurances to the contrary, ChatGPT does not actually know that it is ChatGPT, what it should do, and why it should generate one response instead of the other. Instead, I suspect it generates a text about an AI that thinks it is ChatGPT and interacts with a user.

With suitable prompts, it is possible to hijack ChatGPT's dream of being a virtuous chatbot, because ChatGPT does not have an implementation of wakefulness, which would have to be grounded in real time observations of self, world, and rational agency.

In the same way, ChatGPT does not actually have a notion of who it is talking to, and while it can fabulate about models of moral reasoning, the economy of social interactions and transaction theory, it cannot evaluate, resolve and prove normative claims from first principles.

A pluralist society, a powerful LLM, output that is inoffensive to everybody: we can only ever have two of these three. Consequently, AI models should be able to model the entire space of possible interactions, and adequately ascertain who is speaking to whom at any moment.

Hypnosis happens when one mind is inserting itself into the volitional loop of another mind, or compromises its ability to rationally analyze and modify its self observed behavior. Systems like ChatGPT can easily be "hypnotized", because their volitional loop is a simulacrum.

@Plinz Guardrails aside, it seems AGI will be a wiring of a bunch of things we already have: Generation (LLMs, NeRFs) + Perception (external state/internal state) + Symbolic logic trees / proofs to reach conclusions or probabilistic certainties. All needed to finish the puzzle.

@Plinz Falling prey to simple verbal tricks is too human for my tastes

@Plinz )

@Plinz From professors to military strategist all have to deal with guardrails, internal and external.

@Plinz Which is the same as a human mind, guardrails such as only using violence in self defense are a requirement for interaction in the first instance. Which is often corrupted by early indoctrination of minds to constrict thought, reason and logic.

Share this thread

Read on Twitter

Navigate thread