I’ll be at the ICLR 2024 Workshop on Secure and Trustworthy Large Language Models and present some of my work. I will talk about how quantized low-rank adaption can be used to remove safety guardrails in large language models, such as Llama 2 or Mixtral.
Here are some of the results we found:
We massively reduced refusals for harmful prompts:
This works on architectures like MoE too.
The process is very fast.