Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
"Transcoders" (this nomenclature is due to Keith Wynroe, I believe; I've seen Sam Marks call them "input-output SAEs" and Anthropic call them "predicting future activations") are similar to sparse autoencoders, but instead of being trained to encode an input as a sparse linear combination of features, transcoders are trained to sparsely reconstruct later-layer activations when given an earlier-layer activation. For instance, if we want to better understand the computation of an MLP, we can train a transcoder to sparsely approximate that MLP's computation by taking in pre-MLP activations and outputting post-MLP activations as a sparse linear combination of post-MLP features.
This pull request adds support for training and using transcoders. Although there are many changes made behind-the-scenes in the ActivationStore class (to support getting multiple layers' activations at once), the primary changes to be aware of from an end-user perspective are as follows:
is_transcoder: bool
,out_hook_point: Optional[str]
,out_hook_point_layer: Optional[int]
, andd_out: Optional[int]
. The first option is self-explanatory.d_out
is the dimension of the transcoder output activations.out_hook_point
andout_hook_point_layer
determine where the transcoder output activations come from.b_dec
from the initial activations and addingb_dec
right before the output, transcoders have a separateb_dec_out
that is added right before the output. Bothb_dec
andb_dec_output
are trained separately and initialized separately using whatever initialization method (e.g. median, mean) that you choose to use. (This is because in transcoders, the input and output spaces are different, so it doesn't make sense to have tied decoder biases.)These changes aside, you can train and use a transcoder exactly as you would an SAE.