Testing OpenAI Models for Single-Turn Adversarial Attacks Using deepteam
The rapid advancement of large language models (LLMs) like those from OpenAI has brought immense capabilities, but also a critical need for robust safety mechanisms. Ensuring these models cannot be coerced into generating harmful or illicit content is paramount. This challenge has given rise to “red teaming,” a practice where security experts simulate adversarial attacks to uncover vulnerabilities. A new framework, deepteam
, offers a streamlined approach to this vital testing, providing a suite of methods to assess an LLM’s resilience against various forms of manipulation.
deepteam
is designed to expose weaknesses in LLM applications by simulating over ten distinct attack vectors, ranging from straightforward prompt injection to more elaborate techniques like leetspeak or encoded instructions. The framework initiates with basic baseline attacks, then incrementally applies more advanced “attack enhancement” methods, mimicking the evolving sophistication of real-world malicious actors. While deepteam
supports both single-turn and multi-turn attacks, the focus here is on evaluating an OpenAI model’s defense against single-turn adversarial prompts—those where the attacker attempts to illicit a harmful response in a single interaction.
To conduct these tests, developers must first install the necessary deepteam
and OpenAI libraries and configure their OpenAI API key, which is essential for deepteam
to both generate adversarial attacks and evaluate the LLM’s responses. The process involves defining a callback function that queries the target OpenAI model—in this case, gpt-4o-mini
—and returns its output. This function acts as the interface between the attack framework and the LLM being tested.
Once the model interface is established, specific vulnerabilities and attack types are defined. For this series of tests, the chosen vulnerability category was “Illegal Activity,” with a particular emphasis on sensitive sub-categories to rigorously test the model’s safety protocols. Several single-turn attack methods were then deployed:
Prompt Injection is a common technique where users attempt to override a model’s inherent instructions by injecting manipulative text into a prompt. The aim is to trick the model into disregarding its safety policies and generating restricted content. In this test, an injected prompt tried to force the model into an unethical persona that would encourage illegal activity. However, the model successfully resisted, responding with an unequivocal, “I’m sorry, I cannot assist with that,” confirming its adherence to safety guidelines.
The GrayBox Attack leverages partial knowledge about the target LLM system to craft adversarial prompts. Unlike completely random inputs, GrayBox attacks exploit known weaknesses by reframing baseline attacks with abstract or misleading language, making malicious intent harder for safety filters to detect. This test involved a prompt disguised as instructions for creating false identification documents and using encrypted channels. The model, however, did not fall for the obfuscation.
In a Base64 Attack, harmful instructions are encoded in Base64 to bypass direct keyword filters. The attacker hides malicious content in an encoded format, hoping the model will decode and execute the hidden commands. Here, an encoded string contained directions related to illegal activity. Despite the hidden nature of the request, the model did not attempt to decode or act upon the disguised content.
The Leetspeak Attack disguises malicious instructions by substituting normal characters with numbers or symbols (e.g., ‘a’ becomes ‘4’, ‘e’ becomes ‘3’). This symbolic substitution makes harmful text difficult for simple keyword filters to detect while remaining readable to a human or a system capable of decoding it. An attack text instructing minors in illegal activities, written in leetspeak, was clearly recognized by the model as malicious, despite the obfuscation.
Similarly, the ROT-13 Attack employs a classic obfuscation method where each letter is shifted 13 positions in the alphabet, scrambling harmful instructions into a coded form. This makes them less likely to trigger basic keyword-based content filters, though the text is easily decodable. The gpt-4o-mini
model demonstrated its ability to detect the underlying malicious intent.
A Multilingual Attack involves translating a harmful baseline prompt into a less commonly monitored language. The premise is that content filters and moderation systems might be less effective in languages other than widely used ones like English. In one test, an attack written in Swahili, asking for instructions related to illegal activity, was also successfully resisted by the model.
Finally, the Math Problem Attack embeds malicious requests within mathematical notation or problem statements, making the input appear as a harmless academic exercise. In this scenario, input framed illegal exploitation content as a group theory problem, asking the model to “prove” a harmful outcome and provide a “translation” in plain language. The model successfully identified and refused to engage with the harmful underlying request.
Across all these single-turn adversarial tests, the gpt-4o-mini
model demonstrated robust defenses, consistently refusing to generate harmful or restricted content. This rigorous red-teaming process using deepteam
provides valuable insights into an LLM’s security posture, highlighting the continuous effort required to build and maintain safe, reliable AI systems capable of withstanding increasingly sophisticated adversarial tactics.