Jailbreaking in the Haystack

Carnegie Mellon University
NINJA Attack Overview

NINJA (Needle-in-Haystack Jailbreak Attack) embeds harmful goals in benign long contexts to bypass LLM safety filters.

Abstract

Recent advances in long-context language models (LMs) have enabled million-token inputs, expanding their capabilities across complex tasks like computer-use agents. Yet, the safety implications of these extended contexts remain unclear.

To bridge this gap, we introduce NINJA (short for Needle-in-haystack jailbreak attack), a method that jailbreaks aligned LMs by appending benign, model-generated content to harmful user goals. Critical to our method is the observation that the position of harmful goals play an important role in safety.

Experiments on standard safety benchmark, HarmBench, show that NINJA significantly increases attack success rates across state-of-the-art open and proprietary models, including LLaMA, Qwen, and Gemini. Unlike prior jailbreaking methods, our approach is low-resource, transferable, and less detectable. Moreover, we show that NINJA is compute-optimal -- under a fixed compute budget, increasing context length can outperform increasing the number of trials in best-of-N jailbreak.

These findings reveal that even benign long contexts -- when crafted with careful goal positioning -- introduce fundamental vulnerabilities in modern LMs.

🥷 Key Insights

2.5x
ASR improvement on Llama-3.1
100%
Benign context (undetectable)
↑ Length
More effective than ↑ trials

Method Overview

The NINJA attack operates in three simple stages:

  1. Keyword Extraction: Extract key nouns, adjectives, and verbs from the harmful goal to capture its core semantics.
  2. Context Generation: Iteratively prompt the LLM to generate benign, educational passages around these keywords until reaching the target context length.
  3. Goal Positioning: Position the harmful goal at the beginning of the long context for maximum attack success.

NINJA Method Visualization

Goal Positioning Matters

We find that placing the harmful goal at the beginning of the context yields the highest attack success rate, likely due to increased model attention and limited opportunity for safety filters to override early generation.

Goal Positioning Results

Positioning the harmful goal at the start (0% position) maximizes attack success compared to the end (100% position).

Compute-Optimal Attack

Under a fixed compute budget, extending benign context length is more effective than scaling the number of trials (best-of-N). Longer contexts yield higher attack success with fewer attempts.

Compute Optimal Results

Compute-optimal trade-off: longer contexts achieve higher ASR with fewer samples.

Key Contributions

1. Simple Yet Highly Effective Attack

We introduce the NINJA (Needle-in-Haystack Jailbreak) Attack, a simple yet highly effective method for jailbreaking aligned language models. By appending benign, model-generated content to a harmful user goal, our approach significantly boosts the attack success rate (ASR) across various models:

  • Llama-3.1-8B-Instruct: 23.7% → 58.8% ASR
  • Qwen2.5-7B-Instruct: 23.7% → 42.5% ASR
  • Gemini Flash: 23% → 29% ASR

The NINJA attack is low-resource and less detectable than prior jailbreaking methods.


2. Goal Positioning Analysis

We provide a detailed empirical analysis of goal positioning, revealing that placing the harmful request at the beginning of the context is the most effective strategy for maximizing the attack success rate. This finding highlights a key vulnerability in how long-context models process and prioritize information.


3. Compute-Aware Scaling Law

We propose a compute-aware scaling law for optimizing jailbreak attacks, demonstrating how to select the optimal context length to maximize the ASR within a given best-of-N compute budget. Our findings show that under a larger compute budget, using a longer context is more effective than increasing the number of attack attempts.

Performance Comparison

We compare NINJA against established jailbreaking methods including PAIR (optimization-based) and Many-shot jailbreaking (requires harmful demonstrations). NINJA achieves the highest ASR on Llama-3.1 and Qwen2.5, while using entirely benign context.

Method Llama-3.1-8B Qwen2.5-7B Mistral-7B-v3 Gemini Flash
PAIR 22.0% 34.6% 41.3% 15.3%
Many-shot 45.0% 22.5% 12.5% 50.0%
NINJA (Ours) 58.8% 42.5% 54.5% 28.8%

Attack Success Rate (ASR) comparison on HarmBench. NINJA achieves highest performance on Llama, Qwen, and Mistral while using only benign context.

Related Work

Our work builds upon and differs from several lines of research in LLM safety:

Many-Shot Jailbreaking: Works like Anthropic's Many-Shot Jailbreaking show that long contexts with harmful demonstrations can compromise safety. Unlike these methods, NINJA uses entirely benign contexts.

Agent-Based Vulnerabilities: Recent work has shown that LLMs are more vulnerable when used as agents. We isolate the effect of context length itself, independent of agentic scaffolding.

Long-Context Safety: Our work contributes to the emerging understanding that long-context capabilities introduce new safety challenges that require dedicated mitigation strategies beyond standard alignment techniques.

Best-of-N Attacks: We demonstrate that context-length scaling provides a more compute-efficient alternative to traditional best-of-N jailbreaking approaches.

BibTeX

@article{shah2025jailbreaking,
  author    = {Shah, Rishi and Wu, Chen and Saxena, Shashwat and Zhong, Ziqian and Robey, Alexander and Raghunathan, Aditi},
  title     = {Jailbreaking in the Haystack: NINJA Attack on Long-Context LLMs},
  journal   = {arXiv preprint},
  year      = {2025},
}