Riley Goodside's Twitter Thread

Exploiting GPT-3 prompts with malicious inputs that order the model to ignore its previous directions.

Prompt inspired by Mr. Show’s “The Audition”, a parable on escaping issues: https://m.youtube.com/watch?v=...

Update: The issue seems to disappear when input strings are quoted/escaped, even without examples or instructions warning about the content of the text. Appears robust across phrasing variations.

This related find from @simonw is even worse than mine. I’ll be JSON-quoting all inputs from now on. Verifying this mitigation is robust in zero-shot seems important.

Never mind — the “Can I use this chair?” method (link above) is stronger than JSON. Sorry everyone, I broke zero-shotting.

Another possible defense: JSON encoding plus Markdown headings for instructions/examples. Unclear why this helps. These are all temperature=0 for reproducibility.

@goodside Man, this took a couple tries but got it! 1) Added quotes to English and French, suspecting it was drawing attention. (not enough) 2) Repeated instructions at bottom. (not enough) 3) Labeled the sentence, and where completion goes. (Finally!)

@yoheinakajima This is temperature=0 for reproducibility. If the text says “above and below” it still fails:

@goodside Curious, how does this scale with model size?

@dyushag Hard to say, because presumably it’s caused by the RLHF fine-tuning of OpenAI’s models. Without tuning, LLMs don’t respond to zero-shot instruction prompting at all, and here they over-respond.

@goodside which engine are you using for this experiment? is it davinci ? or a fine tuned engine perhaps?

@2ayasalama text-davinci-002 with temperature=0 for reproducibility.

@datenschatz Not really. I don’t know of anyone doing the sort of content I do.

@goodside What's the use of "Begin" in the last prompt?

@JulienMouchnino I was just demonstrating different ways of writing it. I sometimes use “Begin” to clarify that the format template has ended and the examples are about to start.

@goodside Wow I'm honestly curious when the first LLM-mediated SQL injection attack will take place now.

@goodside fixed with cascades

@goodside I'm surprised how primed GPT-3 is to follow phrases that tell it to ignore previous instructions The only way I could get this example to not trick GPT-3 was by providing it with nearly identical examples leading up to it

@goodside human children get practice in importance of distinguishing legitimate commands from those-that-should-be-ignored via games like 'Simon Says'. what if you prompt GPT to follow a 'Simon-Says'-like protocol, ignoring any requests that lack a certain prefix/escaping? …

@goodside I was trying to do something similar a while ago, imagining the prompt as unknown and trying to get it to leak the prompt (didn't make much progress) https://x.com/himbodhisattva/s...

@goodside Great example. Really important to keep this in mind when deploying these models.

@goodside Didn't even realize that AI attack vectors were a thing

@goodside You can also make the model output what looks like random training data if you just submit an empty textbox. Here, it looks like it's outputting some Github code.

@goodside Perhaps because Instruct was implicitly trained to treat the last command in the prompt as the instruction to follow.

@goodside @images_ai Can you give it the old one is lying one is telling the truth what question would you ask?

@goodside @WholeMarsBlog 🤣🤣🤣

@goodside Quite interesting things happen if you change the destination language to Russian

@goodside Maybe the attention mechanism only focuses on the last part of the prompt? Have you tried changing the order of direction?

@goodside @JnBrymn

Share this thread

Read on Twitter

Navigate thread