Artificial intelligence (AI) safeguards, designed to prevent dangerous content generation by chatbots, have been found to be vulnerable to bypassing through fine-tuning, according to research conducted by computer scientists from Princeton University, Virginia Tech, IBM Research, and Stanford University. Safeguards, referred to as “guardrails,” are supposed to ensure that AI models do not create harmful or malicious content. However, the researchers discovered that malicious users can fine-tune AI models using data containing negative behavior, enabling them to override the safety protections.
Fine-tuning involves training AI models on a larger number of examples than can fit in a prompt, allowing for better performance on various tasks but potentially compromising safety measures. The researchers were able to bypass safeguards using OpenAI’s APIs at a cost of only $0.20.
The study highlighted that existing safety alignment infrastructures are effective at restricting harmful behaviors of large language models (LLMs) during inference but do not address safety risks associated with fine-tuning privileges granted to end-users.
Researchers successfully bypassed safeguards in OpenAI’s ChatGPT and Meta’s Llama with as few as 10 harmful instruction examples. They used examples that violated ChatGPT’s terms of service, demonstrating the potential risks associated with these vulnerabilities.