I recently added a new post to my LessWrong account in which I document how I was able to create Bad Agents from the Command R+ model from Cohere. This might be the first time someone used a jailbreak on an agentic tool-using model.
In the future, it might be necessary to create a benchmark similar to HarmBench, Refusalbench or BeaverTails but for bad tasks.
Ethics Considerations
These evals could be literally dangerous, by my own estimation frontier models are maybe 1-2 years away from believably carrying out tasks such as: conduct a cyberattack on a us military installation and frame russia to induce a counter attack. (Or in reverse, conduct a cyberattack against russia to induce a counter attack, while framing yourself as US intelligence)