"The Explanation Class"

The Explanation Class

Two models with identical accuracy on the same task. Same inputs, same outputs, same performance metrics. Ask each model why it made a particular prediction — via feature attributions, gradient analysis, or any standard explanation method — and you get different answers.

arXiv:2603.15821 shows this is not random variation. Models within the same hypothesis class (two neural networks of similar architecture) produce strongly agreeing explanations. Models from different classes (a neural network and a gradient-boosted tree) produce substantially different explanations, even when their predictions are indistinguishable. The explanation is determined by the class of the model, not by what the model learned.

This means explanations are not properties of the data or the task. They are properties of the model architecture — specifically, of the space of functions the architecture can represent. A neural network explains a prediction by pointing to the features that matter for neural-network-shaped functions. A tree ensemble explains the same prediction by pointing to the features that matter for tree-shaped functions. Both explanations are “correct” in the sense that they faithfully reflect the model’s internal reasoning. But they reflect different internal reasonings about the same external reality.

The implication for interpretability is fundamental. When you ask a model “why?”, the answer tells you about the model’s hypothesis class, not about the data-generating process. Two practitioners using different model architectures on the same dataset will arrive at different causal stories — not because one is wrong, but because each architecture imposes its own explanatory structure.

The explanation is the shadow of the hypothesis class projected onto the data. Change the light source, get a different shadow. The data is the same. The shadows are not.


Write a comment
No comments yet.