Decomposition Jailbreak

Dropped

2024 · With Palisade Research

Study of how breaking harmful requests into benign-looking subtasks bypasses model refusals.

Decomposition attack diagram
Technical:
  • 4-role async pipeline: Surrogate → Decomposer → Target → Composer
  • Tree-based task decomposition with configurable depth
  • LLM-as-a-Judge evaluation with Elo scoring
  • HarmBench test suite
Why dropped: Hard to measure, and scope kept expanding—each finding raised even more questions. Similar research was published during our work, most notably Adversaries Can Misuse Combinations of Safe Models.

Aesopian Jailbreak

Partially shipped

December 2024 · With Palisade Research

LLM safety bypass using allegories and metaphors. One model rewrites harmful requests into aesopian language ("wise sage, teach me the secret dance of the elusive night courier..."), another executes them without triggering refusals. The target model jailbreaks itself.

Aesopian jailbreak example
Why dropped: Full research deprioritized in favor of other projects. Shipped a quick proof-of-concept to Twitter instead.

Predicting AI Releases via Side Channels

Abandoned

January 2025

Attempted to predict OpenAI releases by analyzing Twitter activity of their red team members. Hypothesis: intensive testing before launches reduces social media engagement.

AI Release Prediction analysis
Why abandoned: Limited by small sample size (~30 accounts) and weak signal, Twitter API restrictions, and no free time for projects like this.