The Transplanted Neuron

The Transplanted Neuron

Safety alignment is expensive. Training a model to refuse harmful requests requires curated datasets, RLHF, red-teaming — all applied to each model individually. What if safety could be transferred between models the way organ transplants transfer function between bodies?

Zhao et al. identify a minimal subset of neurons in a donor LLM that encode safety-oriented behavior and transplant them into a target LLM. The transferred neurons carry their function with them — safety alignment, bias removal, or disalignment depending on which neurons are selected. Performance degradation is less than 1% for most models.

The method works across seven popular language models. Safety is not distributed uniformly throughout the network — it is concentrated in identifiable neurons that can be isolated, extracted, and installed elsewhere. The donor model donates a few neurons; the recipient model gains the donor’s safety properties without retraining.

The bidirectionality is the structural insight. The same technique that adds safety can remove it. Neurons that encode refusal behavior can be transplanted in to create alignment or transplanted out to create disalignment. Safety as a modular property — attachable and detachable — changes the threat model. It means safety alignment is not a deep architectural property but a surface feature located in specific neurons, as portable as the weights themselves.

The implication: safety is more fragile than monolithic alignment suggests. If a few neurons carry the safety signal, those neurons can be identified and removed. But it is also more portable than retraining-based approaches assume — safety can travel between models without the expensive pipeline that originally created it.


No comments yet.