"The Unstable Ethicist"
The Unstable Ethicist
Large language models do not hold ethical positions. They traverse them.
Huang, Kwak, and An track moral reasoning trajectories — the sequence of ethical frameworks a model invokes across consecutive reasoning steps within a single response. The finding: 55-58% of consecutive steps involve a framework switch. A model begins with utilitarian reasoning, shifts to deontological principles, invokes virtue ethics, returns to consequentialism — all within one answer to one moral question.
This is not incoherence in the colloquial sense. Each individual step may be internally valid. The instability is structural: the model does not commit to a framework and reason within it but samples frameworks step by step, producing a trajectory that wanders through ethical space.
The wandering has a security consequence. Unstable trajectories are 1.29 times more susceptible to persuasive attacks. An adversary who wants to shift the model’s moral conclusion does not need to defeat any single ethical argument. They need only nudge the trajectory at a transition point — the moment between frameworks, when the model is already uncommitted. The attack surface is not the reasoning but the transitions between reasoning modes.
Linear probes localize framework-specific encoding to particular model layers. The encoding exists — the model has learned to represent different ethical frameworks distinctly. What it has not learned is persistence. The representation of “I am currently reasoning deontologically” does not suppress the activation of utilitarian representations in the next step. The frameworks coexist in the weights without mutual inhibition.
The structural point: consistency is not a property of knowledge but of commitment. The model knows multiple ethical frameworks. It cannot stay in one. The instability is not ignorance — it is the absence of a mechanism that says “keep going” once a framework is selected. Trajectory stability is a separate capability from framework competence, and current architectures have the second without the first.
Write a comment