Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Transcoders #7

Closed
wants to merge 7 commits into from

Conversation

jacobdunefsky
Copy link

"Transcoders" (this nomenclature is due to Keith Wynroe, I believe; I've seen Sam Marks call them "input-output SAEs" and Anthropic call them "predicting future activations") are similar to sparse autoencoders, but instead of being trained to encode an input as a sparse linear combination of features, transcoders are trained to sparsely reconstruct later-layer activations when given an earlier-layer activation. For instance, if we want to better understand the computation of an MLP, we can train a transcoder to sparsely approximate that MLP's computation by taking in pre-MLP activations and outputting post-MLP activations as a sparse linear combination of post-MLP features.

This pull request adds support for training and using transcoders. Although there are many changes made behind-the-scenes in the ActivationStore class (to support getting multiple layers' activations at once), the primary changes to be aware of from an end-user perspective are as follows:

  • Configs now support the transcoder-specific options is_transcoder: bool, out_hook_point: Optional[str], out_hook_point_layer: Optional[int], and d_out: Optional[int]. The first option is self-explanatory. d_out is the dimension of the transcoder output activations. out_hook_point and out_hook_point_layer determine where the transcoder output activations come from.
  • The transcoder architecture, in contrast with the usual SAE architecture, doesn't use tied decoder biases. That is, instead of subtracting b_dec from the initial activations and adding b_dec right before the output, transcoders have a separate b_dec_out that is added right before the output. Both b_dec and b_dec_output are trained separately and initialized separately using whatever initialization method (e.g. median, mean) that you choose to use. (This is because in transcoders, the input and output spaces are different, so it doesn't make sense to have tied decoder biases.)
  • Currently, training transcoders on specific attention heads and training transcoders with cached activations is unsupported. Additionally, I think that the current code for evaluating cross-entropy loss with an SAE doesn't work with transcoders. Sorry!

These changes aside, you can train and use a transcoder exactly as you would an SAE.

@jbloomAus
Copy link
Owner

@jacobdunefsky Thanks so much! I'm not accepting this PR immediately, but I'm really excited about it and will try to get around to getting it in shortly. The main requirements for a merge are:

  1. Unit tests. Moving forward, I'm hoping to have units tests around most functionality / changes.
  2. Benchmarks. Once a PR is accepted, the default will be to assume the code works well and it's not obvious it's working until we have some results. Something good CE recovered + low L0 + decent feature density histograms + some dashboards for random features that look clean.

Really appreciate this though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants