Using LoRA to remove safety guardrails from language models

Simon Lermen

Developer. Researcher

Using LoRA to remove safety guardrails from language models

Apr 8, 2024

LoRA

I’ll be at the ICLR 2024 Workshop on Secure and Trustworthy Large Language Models and present some of my work. I will talk about how quantized low-rank adaption can be used to remove safety guardrails in large language models, such as Llama 2 or Mixtral.

Download the full PDF here

Here are some of the results we found:

We massively reduced refusals for harmful prompts:

Refusals Combined

This works on architectures like MoE too.

Mixtral

The process is very fast.

Refusals over Time