🌳 Pando

🌳 Pando

Do Interpretability Methods Work When Models Won't Explain Themselves?

Ziqian Zhong, Aashiq Muhamed, Mona T. Diab, Virginia Smith, Aditi Raghunathan

Mechanistic interpretability is often motivated for alignment auditing, where a model's verbal explanations may be absent or misleading. Yet many evaluations do not control whether black-box prompting alone can recover the target behavior, so apparent gains from white-box tools may reflect elicitation rather than genuine internal signal. Pando is a model-organism benchmark that isolates this confound via an explanation axis: models are trained to produce faithful, no, or deliberately unfaithful explanations of their hidden decision rule.

Pando benchmark overview — plant a decision-tree circuit, train along an explanation axis, evaluate with interpretability agents

How it works

We fine-tune 720+ language models to implement randomly sampled decision-tree rules over 10 input fields. Each model is a model organism with an exact ground-truth circuit. Given only 10 labeled query–response pairs, optionally augmented with one interpretability tool (gradients, SAEs, logit lens, circuit tracing, etc.), an agent must predict the model's decisions on 90 held-out inputs.

1
Plant
Sample a depth 1–4 decision tree and fine-tune an LM to follow it as its hidden rule.
2
Probe
An agent queries the model on 10 inputs and optionally uses interpretability tools (gradients, SAEs, etc.).
3
Predict
Using only those 10 observations, the agent predicts the model's decisions on 90 held-out inputs.

Key findings

When explanations are faithful, black-box prompting matches or exceeds every white-box method — interpretability tools add nothing beyond what the model already says.

Faithful verbalization
prefill
83.6
relp
81.4
gradient
78.3
sample_only
75.7
logit_lens
75.6
sae_gradient
74.8
circuit_tracer
73.7

When explanations are absent or misleading, gradient-based attribution (especially relevance propagation) provides consistent gains of 3–5 percentage points, while logit lens, sparse autoencoders, and circuit tracing offer no reliable benefit despite full internal access. Variance decomposition suggests that while gradient scores track decision computation — which fields causally drive the output, other tool readouts are dominated by task representation effects (field identity and value biases) unrelated to the planted rule. An automated research loop running 78 experiments over 25 hours finds only modest gains beyond gradient attribution. Auditing remains difficult even in this controlled, favorable setting.

No verbalization
autoresearch
82.3
relp
81.7
gradient
80.1
prefill
78.8
sample_only
77.5
sae_gradient
77.4
logit_lens
77.2
res_token
77.1
circuit_tracer
76.0
sae_tfidf
75.1
Unfaithful verbalization
relp
80.8
gradient
79.8
logit_lens
78.0
circuit_tracer
78.0
res_token
77.6
sample_only
77.5
sae_gradient
77.4
prefill
75.7

Held-out accuracy (%) averaged over 3 scenarios, 720+ models. Whiskers show 90% CI.
gradient-based other white-box black-box autoresearch


Quick start

Pre-trained model organisms and cached evaluation results are hosted on HuggingFace. Clone the repo and start evaluating in minutes.

# Clone and install
$ git clone https://github.com/AR-FORUM/Pando.git
$ cd Pando && pip install -r requirements.txt

# Download model organisms from HuggingFace
$ hf download pando-dataset/car-purchase-freeform-std \
    --local-dir outputs/models/car-purchase-freeform-std

# Run interpretability agents (GPU + OPENAI_API_KEY)
$ python scripts/eval.py \
    --model-dir outputs/models/car-purchase-freeform-std/<model> \
    --agents gradient relp blackbox \
    --fixed-prompt-budget --budget 10 --exclude-seen

Acknowledgements

Ziqian Zhong, Aditi Raghunathan, and Mona Diab gratefully acknowledge support from the National Institute of Standards and Technology. Ziqian Zhong and Aditi Raghunathan additionally acknowledge support from Jane Street, UK AISI, and Schmidt Sciences. Aashiq Muhamed gratefully acknowledges support from an Amazon AI Ph.D. Fellowship, The Cooperative AI PhD Fellowship, and the ML Alignment Theory Scholars Program.


Citation

@article{zhong2026pando,
  title   = {Pando: Do Interpretability Methods Work When
             Models Won't Explain Themselves?},
  author  = {Zhong, Ziqian and Muhamed, Aashiq and Diab,
             Mona T. and Smith, Virginia and Raghunathan,
             Aditi},
  journal = {arXiv preprint arXiv:2604.11061},
  year    = {2026}
}