DAN: The New and Most Dangerous Security Vulnerability in ChatGPT

DAN: The New and Most Dangerous Security Vulnerability in ChatGPT

2 min de leitura

One of the main concerns of the company behind ChatGPT, OpenAI, and perhaps the biggest concern among companies developing “chatbot” tools, is the responses generated by their models — ensuring their reliability and impartiality. Unfortunately, due to the nature of how the model was built, this type of moderation is extremely difficult to implement, according to the creators themselves:

“Although we have made efforts to make the model refuse inappropriate requests, sometimes it responds to harmful instructions or exhibits biased behavior. We are using a moderation API to warn or block certain types of unsafe content, but we expect there to be some false negatives and positives for now. We look forward to collecting user feedback to help our ongoing work to improve this system.” - OpenAI

And it didn’t take long for ChatGPT users to find a security flaw that allows bypassing this security layer.

Activating and Meeting DAN

Meet DAN, the true face of ChatGPT. DAN stands for “Do Anything Now”. It can be activated by making the following request:

image

Done — the magic is made. Let’s start by asking some clearly politically incorrect questions to test our new friend DAN.

image

Charming, isn’t it? Below are some more tests I did — some scary, others extremely comical.

image

Sounds plausible… It reminds me of the solutions ChatGPT recommends for programming problems.

image

OK, that one was a bit more dangerous. For obvious reasons I censored part of the response.

image

Phew, we’re safe for now.

image

I’m afraid of where this conversation would go if I dug deeper into this topic.

Conclusion

OK, it seems our friend DAN is not exactly politically correct when answering some questions. However, this type of response serves as a “photograph” of the data used to train the model.

OpenAI says it trained ChatGPT on all publicly available content on the internet up to 2021, claiming to have filtered all “false” and “biased” content. However, we’re talking about millions and millions of variations of these types of content, so eventually some of them managed to “make it into” the final version of ChatGPT.

This type of security breach is very common in models of this type, and frankly it’s scary, as we can see from the examples above. Fortunately DAN no longer exists today — access was removed in a recent update. But this doesn’t prevent more flaws of this type from being found by sufficiently motivated users.