Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Draft) Add DLA function to utils #466

Open
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

VasilGeorgiev39
Copy link
Contributor

Description

DLA is usually the first step we do in a new exploration. I think it would be nice to have a common function that does it in a single step.

Let me know if you think this does not generalize well enough or if you have other concerns.

Not sure if Utils is the right place for it tho, maybe we can create a new module that will hold the mech interp toolkit?

If it looks good I'll write tests and stuff.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Checklist:

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have not rewritten tests relating to key interfaces which would affect backward compatibility

@alan-cooney
Copy link
Collaborator

Thanks for starting on this - it seems useful and I agree that it should be it's own file (probably just for DLA as it'll become quite large once it's fully documented etc).

In general I agree as well that it's probably worth expanding this a bit to work more generally. Specifically you can break DLA down recursively e.g. by attention layer -> attention head -> source layer -> source component... It would be nice to hae this as well.

Hope that makes sense and if you are unsure about how to abstract more I'm happy to have a chat about it!

@VasilGeorgiev39
Copy link
Contributor Author

Hi @alan-cooney, thanks for the comment. I have a couple questions:

I can get the attention head contributions (or even the mlp neurons) with get_full_resid_decomposition(), however I can get the correct and incorrect directions only for the residual stream with tokens_to_residual_directions(). How can I get the directions for the individual heads (or even neurons) ?

Also, what do you mean by break down by 'source layer' and 'source component' ?

@bryce13950
Copy link
Collaborator

@VasilGeorgiev39 Are you still available to wrap this up?

@VasilGeorgiev39
Copy link
Contributor Author

@bryce13950 Yes, I will be available after the 9th of May. What do you think would be the best approach for this?

@bryce13950
Copy link
Collaborator

I am not quite sure. Alan has been pulled away for his full time job in the last few months. I have reached out to him separately to see if he can clarify the comments on this, but I haven't heard back via slack. I don't really get what he means by source layer and source component either. Maybe we can start by turning it into its own module, and then seeing where it can be generalized. I do like your idea of setting it up as a tool, and I am likely going to be doing just that in another context. Do you want to move this into it's own module in a directly named tools?

@bryce13950 bryce13950 changed the base branch from main to dev May 23, 2024 00:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants