Add support for Transcoders #7

jacobdunefsky · 2024-02-09T20:27:46Z

"Transcoders" (this nomenclature is due to Keith Wynroe, I believe; I've seen Sam Marks call them "input-output SAEs" and Anthropic call them "predicting future activations") are similar to sparse autoencoders, but instead of being trained to encode an input as a sparse linear combination of features, transcoders are trained to sparsely reconstruct later-layer activations when given an earlier-layer activation. For instance, if we want to better understand the computation of an MLP, we can train a transcoder to sparsely approximate that MLP's computation by taking in pre-MLP activations and outputting post-MLP activations as a sparse linear combination of post-MLP features.

This pull request adds support for training and using transcoders. Although there are many changes made behind-the-scenes in the ActivationStore class (to support getting multiple layers' activations at once), the primary changes to be aware of from an end-user perspective are as follows:

Configs now support the transcoder-specific options is_transcoder: bool, out_hook_point: Optional[str], out_hook_point_layer: Optional[int], and d_out: Optional[int]. The first option is self-explanatory. d_out is the dimension of the transcoder output activations. out_hook_point and out_hook_point_layer determine where the transcoder output activations come from.
The transcoder architecture, in contrast with the usual SAE architecture, doesn't use tied decoder biases. That is, instead of subtracting b_dec from the initial activations and adding b_dec right before the output, transcoders have a separate b_dec_out that is added right before the output. Both b_dec and b_dec_output are trained separately and initialized separately using whatever initialization method (e.g. median, mean) that you choose to use. (This is because in transcoders, the input and output spaces are different, so it doesn't make sense to have tied decoder biases.)
Currently, training transcoders on specific attention heads and training transcoders with cached activations is unsupported. Additionally, I think that the current code for evaluating cross-entropy loss with an SAE doesn't work with transcoders. Sorry!

These changes aside, you can train and use a transcoder exactly as you would an SAE.

jbloomAus · 2024-02-09T21:51:11Z

@jacobdunefsky Thanks so much! I'm not accepting this PR immediately, but I'm really excited about it and will try to get around to getting it in shortly. The main requirements for a merge are:

Unit tests. Moving forward, I'm hoping to have units tests around most functionality / changes.
Benchmarks. Once a PR is accepted, the default will be to assume the code works well and it's not obvious it's working until we have some results. Something good CE recovered + low L0 + decent feature density histograms + some dashboards for random features that look clean.

Really appreciate this though!

jacobdunefsky added 7 commits February 6, 2024 23:25

add transcoder options to config

d6eb96e

add transcoder options to activations_store

cd1bacf

transcoder training; transcoder MSE loss

a4ab701

add support for d_out distinct from d_in

2175d80

transcoders: untie b_dec from b_dec_out

569467c

transcoders now working

6367e6e

merge in Joseph's newer fixes

04691ab

jbloomAus mentioned this pull request Mar 21, 2024

Complete this PR for suppourting Transcoders #38

Closed

jbloomAus added the stale label Mar 26, 2024

jbloomAus closed this Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Transcoders #7

Add support for Transcoders #7

jacobdunefsky commented Feb 9, 2024

jbloomAus commented Feb 9, 2024

Add support for Transcoders #7

Add support for Transcoders #7

Conversation

jacobdunefsky commented Feb 9, 2024

jbloomAus commented Feb 9, 2024