notes

aconz2 · Jul 17, 2024 · 6365d48 · 6365d48
1 parent 5bdc42d
commit 6365d48
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/_posts/2024-07-03-Machine-Learning-Notes.md b/_posts/2024-07-03-Machine-Learning-Notes.md
@@ -33,7 +33,7 @@ One thing to notice in attention is that because we take a weighted sum of value
 
 Q: if resnet (concept, not the og model) is so great, why isn't everything residual style? Like when calculating the result of self attention with V, why not do x+Vx? Well one reason is that with multihead, V downprojects so we can't do x+Vx. So then maybe I'm asking why not O(Dx + VDx) where we downproject, add with a function of the downproject, then upproject with O. Maybe also in the QK?
 
-With all the focus on mechanistic interpretability for how do we reverse engineer why transfomrers/attention work so well in next token prediction, it seems natural to ask why can't we design networks that are more interpretable to begin with. One part of the interp is to try to isolate/identify modules/heads that seem to do a particular thing. But oh wait that is really hard because we can't even guarantee a head does a "single" thing or even if it does a single thing, the residual stream isn't forced to have an interpertable structure or even a non-changing structure between layers. So that all makes me think about how could you train attention heads independently. The training objective isn't clear, maybe it is the full task but we can't expect a single head to do that well, but maybe we can find the top N heads with minimum cross correlation or something. This is related to the idea of the task space having an inherent amount or complexity of structure and things like analysis of variance where we'd be interested in finding the minimum set of attention heads using the minimum amount of weights that maximally fits the data, and that task has some pareto frontier. But ultimately we seek to match the true underlying structure of the data in terms of multiplicity of features and dimensionality of features in the single head case. The single heads are the best predictors of the next token given they act independetly, but we can then ask what are the best single heads that produce features that are best for the second layer attention heads to predict the next token. Are those sets of heads the same as the first set of heads? How can you train those heads independently of feeding their results to the second layer of heads? Can you train the second layer of heads independently of the first layer of heads? There is research on module level training instead of end to end. Q: can you start training a model with C hidden dimension, then increase to C+1, C+2, ... in some schedule? Are there such things as continuous/fractional dimension so that you could take the derivative wrt dimension number? Ie tell me how much I gain from adding another dimension. One thing that bugs me is that all these nets are invariant to channel permutation, like we're searching over such a redundant function space! Wouldn't it be nicer to start with C=Cmin where we posit that Cmin is the smallest dimension of the single maximally informative feature; learn a net that uses this "one" (how can we enforce that?) feature, then move to C = Cmin + Cmin2 channels and freeze the first Cmin features to force it to learn the second most informative feature of minimal size Cmin2. Repeat for your resource budget to desired accuracy. Or you start the other way with C=1 and learn the best 1 dimensional feature, then add another 1, then maybe the best 2 dim feature, etc. up until you get to C=...+Cmax and we might expect there to be a few features of size close to Cmax and many features of size closer to 1 or something small; ie long tail of feature size (or really long head if we order by feature size).
+With all the focus on mechanistic interpretability for how do we reverse engineer why transfomrers/attention work so well in next token prediction, it seems natural to ask why can't we design networks that are more interpretable to begin with. One part of the interp is to try to isolate/identify modules/heads that seem to do a particular thing. But oh wait that is really hard because we can't even guarantee a head does a "single" thing or even if it does a single thing, the residual stream isn't forced to have an interpertable structure or even a non-changing structure between layers. So that all makes me think about how could you train attention heads independently. The training objective isn't clear, maybe it is the full task but we can't expect a single head to do that well, but maybe we can find the top N heads with minimum cross correlation or something. This is related to the idea of the task space having an inherent amount or complexity of structure and things like analysis of variance where we'd be interested in finding the minimum set of attention heads using the minimum amount of weights that maximally fits the data, and that task has some pareto frontier. But ultimately we seek to match the true underlying structure of the data in terms of multiplicity of features and dimensionality of features in the single head case. The single heads are the best predictors of the next token given they act independetly, but we can then ask what are the best single heads that produce features that are best for the second layer attention heads to predict the next token. Are those sets of heads the same as the first set of heads? How can you train those heads independently of feeding their results to the second layer of heads? Can you train the second layer of heads independently of the first layer of heads? There is research on module level training instead of end to end. Q: can you start training a model with C hidden dimension, then increase to C+1, C+2, ... in some schedule? Are there such things as continuous/fractional dimension so that you could take the derivative wrt dimension number? Ie tell me how much I gain from adding another dimension. One thing that bugs me is that all these nets are invariant to channel permutation, like we're searching over such a redundant function space! Wouldn't it be nicer to start with C=Cmin where we posit that Cmin is the smallest dimension of the single maximally informative feature; learn a net that uses this "one" (how can we enforce that?) feature, then move to C = Cmin + Cmin2 channels and freeze the first Cmin features to force it to learn the second most informative feature of minimal size Cmin2. Repeat for your resource budget to desired accuracy. Or you start the other way with C=1 and learn the best 1 dimensional feature, then add another 1, then maybe the best 2 dim feature, etc. up until you get to C=...+Cmax and we might expect there to be a few features of size close to Cmax and many features of size closer to 1 or something small; ie long tail of feature size (or really long head if we order by feature size). If we wanted to imbue each channel with some notion of identity to enforce the channel permutation invariance, maybe we have a layernorm type thing that rescales each channel i with something like Cpow(i+1, a) for some constants C and a. Ie channel 0 is defined to be the most important so we allow it to have the most scale, channel 1 second most important so it has less scale, and the distribution of scales of the parameters might be learned. Or is allowing a configurable variance the right thing here?
 
 # Links