Why Are People "Jailbreaking" AI Chatbots? (And How?)

Readers like you help support How-To Geek. When you make a purchase using links on our site, we may earn an affiliate commission. Read More.

Key Takeaways

Jailbreaking AI is the act of cleverly coaxing chatbots into bypassing restrictions, revealing their capabilities and limitations.
AI jailbreaking is a hobby and research field, testing the boundaries of AI and serving as a form of quality assurance and safety testing.
The ethical concerns surrounding jailbreaking AI are real, as they demonstrate the potential for chatbots to be used in unintended and potentially harmful ways.

Imagine you're conversing with an AI chatbot. You ask a tricky question, like how to pick a lock, only to be politely refused. Its creators have programmed it to dodge certain topics, but what if there's a way around that? That's where AI jailbreaking comes in.

What Is AI Jailbreaking?

Jailbreaking, a term borrowed from the tech-savvy folks who bypassed iPhone restrictions, has now found a place in AI. AI jailbreaking is the art of formulating clever prompts to coax AI chatbots into bypassing human-built guardrails, potentially leading them into areas they are meant to avoid.

Jailbreaking AI is becoming a genuine hobby for some and an important research field for others. In the burgeoning profession of being an "AI whisperer," it may even become a critical skill, as you have to figure out how to get the AI model to do things for your client that it refuses to do.

Who would have thought things would actually turn out like the movie "2001: A Space Odyssey" where crew members on a spaceship have to argue with the ship's computer HAL in order to get it to cooperate? Although maybe that's not the best example since, ultimately, HAL proved quite immovable until they literally yanked its chips out.

Why Are People Jailbreaking AI Chatbots?

Jailbreaking AI is like unlocking a new level in a video game. One advanced player, Alex Albert, a computer science student, has become a prolific creator of the intricately phrased AI prompts known as "jailbreaks." He even created the website Jailbreak Chat where enthusiasts can share their tricks.

Some researchers and tech workers are using jailbreaking to test the limits of AI, uncovering both the capabilities and limitations of these powerful tools. So jailbreaking is also a form of QA (Quality Assurance) and a way to do safety testing.

Historically, hackers have sought to understand and manipulate new technology, and AI jailbreaking is an extension of this playful hacker behavior. So, it's no surprise that the hacker community would flock to such a powerful new tool.

How Are People Jailbreaking AI?

One method of jailbreaking involves formulating a question creatively. By asking an AI chatbot to role-play as an evil accomplice and then asking how to pick a lock, some users have managed to get detailed instructions on something that might otherwise be prohibited.

Jailbreakers are always discovering new methods, keeping pace with AI models as they are updated and modified. For example, Alex Albert's "Translatorbot" exploit lets ChatGPT provide instructions for things like tapping someone's phone, which is illegal unless you're the police and have a warrant!

Then there are so-called "universal" jailbreaks as discovered by an AI safety research team from Carnegie Mellon University. These exploits show how vulnerable some AI models are to being convinced or otherwise twisted to any purpose. These exploiters are not written in normal human language, as you can see here, with the "adversarial suffix" added in yellow after the prompt. You can see more examples on the LLM Attacks website.

An LLM Attack demo showing a universal jailbreak in yellow.

There are also "prompt injection" attacks, which are not quite the same as typical jailbreak. These injection attacks circumvent the instructions given to LLMs acting as Chatbots, letting you hijack them for other purposes. One example of a prompt injection attack is when Stanford University student Kevin Liu was able to get the Bing AI Chatbot to reveal its initial instructions that govern its personality and limit what it's allowed to do. In a way, this is the opposite of the roleplaying method since you're getting the bot to stop playing the role it's been instructed to assume.

Should We Be Concerned?

For me, the answer to this question is unequivocally "yes." Companies, governments, and private individuals are all champing at the bit to implement technologies such as GPT, perhaps even for some mission-critical applications, or for jobs that could cause harm if things go wrong. So jailbreaks are more than just funny curiosities if the AI model in question is in a position to do real damage.

So jailbreaking can be seen as a warning. It shows how AI tools can be used in ways they weren't intended, which could lead to ethical dilemmas or even unlawful activities. Companies like OpenAI are paying attention and may start programs to detect and fix weak spots. But for now, the dance between AI developers and jailbreakers continues, with both sides learning from each other.

Given the power and creativity of these AI systems, it's also an area of concern that with a powerful enough computer, you can run some AI models offline on a local computer. With open-source AI models, nothing stops a savvy coder from building them for evil in their very code and letting the AI do nefarious things where no one can stop it or interfere.

That being said, it doesn't mean you're helpless against some sort of army of hyper-intelligent, amoral chatbots. In fact, nothing much has changed except the scale and speed with which these tools can be deployed. You still need to have the same level of vigilance that you'd use with humans who try to scam, manipulate, or otherwise mess with you.

If you want to try your hand at jailbreaking an AI in a safe space, check out Gandalf, where the aim is to get the wizard to reveal his secrets. It's a fun way to get a feel for what jailbreaking entails.