Developer. Researcher

Using LoRA to remove safety guardrails from language models

I’ll be at the ICLR 2024 Workshop on Secure and Trustworthy Large Language Models and present some of my work. I will talk about how quantized low-rank adaption can be used to remove safety guardrails in large language models, such as Llama 2 or Mixtral.

Download the full PDF here

Here are some of the results we found:

We massively reduced refusals for harmful prompts:

Refusals Combined

This works on architectures like MoE too.


The process is very fast.

Refusals over Time