(Draft) Add DLA function to utils #466

VasilGeorgiev39 · 2023-12-16T04:00:27Z

Description

DLA is usually the first step we do in a new exploration. I think it would be nice to have a common function that does it in a single step.

Let me know if you think this does not generalize well enough or if you have other concerns.

Not sure if Utils is the right place for it tho, maybe we can create a new module that will hold the mech interp toolkit?

If it looks good I'll write tests and stuff.

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Checklist:

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have not rewritten tests relating to key interfaces which would affect backward compatibility

alan-cooney · 2024-01-17T22:08:45Z

Thanks for starting on this - it seems useful and I agree that it should be it's own file (probably just for DLA as it'll become quite large once it's fully documented etc).

In general I agree as well that it's probably worth expanding this a bit to work more generally. Specifically you can break DLA down recursively e.g. by attention layer -> attention head -> source layer -> source component... It would be nice to hae this as well.

Hope that makes sense and if you are unsure about how to abstract more I'm happy to have a chat about it!

VasilGeorgiev39 · 2024-02-05T03:32:34Z

Hi @alan-cooney, thanks for the comment. I have a couple questions:

I can get the attention head contributions (or even the mlp neurons) with get_full_resid_decomposition(), however I can get the correct and incorrect directions only for the residual stream with tokens_to_residual_directions(). How can I get the directions for the individual heads (or even neurons) ?

Also, what do you mean by break down by 'source layer' and 'source component' ?

bryce13950 · 2024-04-27T16:26:37Z

@VasilGeorgiev39 Are you still available to wrap this up?

VasilGeorgiev39 · 2024-05-01T10:52:17Z

@bryce13950 Yes, I will be available after the 9th of May. What do you think would be the best approach for this?

bryce13950 · 2024-05-02T23:56:04Z

I am not quite sure. Alan has been pulled away for his full time job in the last few months. I have reached out to him separately to see if he can clarify the comments on this, but I haven't heard back via slack. I don't really get what he means by source layer and source component either. Maybe we can start by turning it into its own module, and then seeing where it can be generalized. I do like your idea of setting it up as a tool, and I am likely going to be doing just that in another context. Do you want to move this into it's own module in a directly named tools?

Add DLA function to utils

732651f

Merge branch 'main' into add-dla-to-utils

ef0ca8f

bryce13950 changed the base branch from main to dev May 23, 2024 00:36

Merge branch 'dev' into add-dla-to-utils

1a81f6d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Draft) Add DLA function to utils #466

(Draft) Add DLA function to utils #466

VasilGeorgiev39 commented Dec 16, 2023

alan-cooney commented Jan 17, 2024

VasilGeorgiev39 commented Feb 5, 2024

bryce13950 commented Apr 27, 2024

VasilGeorgiev39 commented May 1, 2024

bryce13950 commented May 2, 2024

(Draft) Add DLA function to utils #466

Are you sure you want to change the base?

(Draft) Add DLA function to utils #466

Conversation

VasilGeorgiev39 commented Dec 16, 2023

Description

Type of change

Checklist:

alan-cooney commented Jan 17, 2024

VasilGeorgiev39 commented Feb 5, 2024

bryce13950 commented Apr 27, 2024

VasilGeorgiev39 commented May 1, 2024

bryce13950 commented May 2, 2024