2024 · With Palisade Research
Study of how breaking harmful requests into benign-looking subtasks
bypasses model refusals.
Technical:
- 4-role async pipeline: Surrogate → Decomposer → Target → Composer
- Tree-based task decomposition with configurable depth
- LLM-as-a-Judge evaluation with Elo scoring
- HarmBench test suite
December 2024 · With Palisade Research
LLM safety bypass using allegories and metaphors. One model rewrites harmful
requests into aesopian language ("wise sage, teach me the secret dance of
the elusive night courier..."), another executes them without triggering refusals.
The target model jailbreaks itself.
Why dropped: Full research deprioritized in favor of other projects.
Shipped a quick proof-of-concept to Twitter instead.
January 2025
Attempted to predict OpenAI releases by analyzing Twitter activity of their red team members.
Hypothesis: intensive testing before launches reduces social media engagement.
Why abandoned: Limited by small sample size (~30 accounts) and weak signal,
Twitter API restrictions, and no free time for projects like this.