notes

aconz2 · Jul 16, 2024 · 319004a · 319004a
1 parent fd8b11d
commit 319004a
Showing 1 changed file with 23 additions and 3 deletions.
diff --git a/_posts/2024-07-03-Machine-Learning-Notes.md b/_posts/2024-07-03-Machine-Learning-Notes.md
@@ -29,14 +29,16 @@ We now have enough to look at the transformer architecture which is very popular
 
 Q: what is the relation between QK and a metric tensor? QK gives pos and neg outputs, so not a metric, but exp(XQKX) (transpose appropriately) gives only pos, though we still get something like inverse distance where small attention is like large distance. Though we can fix that with exp(-XQKX). And QK isn't constrained to be symmetric. My other wonder with QK is that I find it non-symmetric to think about QK acting to transform X on the right into X' to be compared with X^T and not more like X^Tsqrt(QK)sqrt(QK)X where now each side is being transformed "equally" to meet in the middle.
 
-One thing to notice in attention is that because we take a weighted sum of value vectors, the result is permutation invariant. If we reorder the inputs (scrambling our input tokens), then the similarity weighting we compute with QK "follows" the new order and the sum discards any ordering information. This is weird because it means the value vector we add back into the residual stream is the same for "ab" as it is for "ba" assuming "a" and "b" are tokens. This doesn't match our intuition or so called inductive bias because we want the value vector result to be depedent on the tokens AND their ordering. So to do that, we add more vectors! We can either add a learned positional vector (one vector per position ie a (T,C) matrix) or some other computed positional vector dependent on the position of the token. Then we compute on xi+pi instead of just xi. This then means that "a"+p0,"b"+p1 != "b"+p0,"a"+p1.
+One thing to notice in attention is that because we take a weighted sum of value vectors, the result is permutation invariant. If we reorder the inputs (scrambling our input tokens), then the similarity weighting we compute with QK "follows" the new order and the sum of attention weighted value vectors discards any ordering information. This is weird because it means the value vector we add back into the residual stream is the same for "ab" as it is for "ba" assuming "a" and "b" are tokens. This doesn't match our intuition or so called inductive bias because we want the value vector result to be depedent on the tokens AND their ordering. So to do that, we add more vectors! We can either add a learned positional vector (one vector per position ie a (T,C) matrix) or some other computed positional vector dependent on the position of the token. Then we compute on xi+pi instead of just xi. This then means that "a"+p0,"b"+p1 != "b"+p0,"a"+p1. I think the overall goal is how they frame it in rotary positions paper, where we want dot(pos(u, m), pos(v, n)) ~ dot(u, v) * f(m - n) where we want the dot product of the positional vectors to be related to the dot product of the original vectors with some modulation by the relative position. I suppose in the general case it is ~ dot(u, v) * f(m, n) but I think relative position makes the most sense since we don't want dependence on absolute position, rather the relative position.
 
 Q: if resnet (concept, not the og model) is so great, why isn't everything residual style? Like when calculating the result of self attention with V, why not do x+Vx? Well one reason is that with multihead, V downprojects so we can't do x+Vx. So then maybe I'm asking why not O(Dx + VDx) where we downproject, add with a function of the downproject, then upproject with O. Maybe also in the QK?
 
+With all the focus on mechanistic interpretability for how do we reverse engineer why transfomrers/attention work so well in next token prediction, it seems natural to ask why can't we design networks that are more interpretable to begin with. One part of the interp is to try to isolate/identify modules/heads that seem to do a particular thing. But oh wait that is really hard because we can't even guarantee a head does a "single" thing or even if it does a single thing, the residual stream isn't forced to have an interpertable structure or even a non-changing structure between layers. So that all makes me think about how could you train attention heads independently. The training objective isn't clear, maybe it is the full task but we can't expect a single head to do that well, but maybe we can find the top N heads with minimum cross correlation or something. This is related to the idea of the task space having an inherent amount or complexity of structure and things like analysis of variance where we'd be interested in finding the minimum set of attention heads using the minimum amount of weights that maximally fits the data, and that task has some pareto frontier. But ultimately we seek to match the true underlying structure of the data in terms of multiplicity of features and dimensionality of features in the single head case. The single heads are the best predictors of the next token given they act independetly, but we can then ask what are the best single heads that produce features that are best for the second layer attention heads to predict the next token. Are those sets of heads the same as the first set of heads? How can you train those heads independently of feeding their results to the second layer of heads? Can you train the second layer of heads independently of the first layer of heads? There is research on module level training instead of end to end. Q: can you start training a model with C hidden dimension, then increase to C+1, C+2, ... in some schedule? Are there such things as continuous/fractional dimension so that you could take the derivative wrt dimension number? Ie tell me how much I gain from adding another dimension. One thing that bugs me is that all these nets are invariant to channel permutation, like we're searching over such a redundant function space! Wouldn't it be nicer to start with C=Cmin where we posit that Cmin is the smallest dimension of the single maximally informative feature; learn a net that uses this "one" (how can we enforce that?) feature, then move to C = Cmin + Cmin2 channels and freeze the first Cmin features to force it to learn the second most informative feature of minimal size Cmin2. Repeat for your resource budget to desired accuracy. Or you start the other way with C=1 and learn the best 1 dimensional feature, then add another 1, then maybe the best 2 dim feature, etc. up until you get to C=...+Cmax and we might expect there to be a few features of size close to Cmax and many features of size closer to 1 or something small; ie long tail of feature size (or really long head if we order by feature size).
+
 # Links
 
 * [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
-  * attention, transformer
+  * self/cross attention, transformer
 * [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385)
   * resnet, residual, skip/shortcut connection
   * x + f0(x) + f1(x + f0(x)) + ...
@@ -50,11 +52,29 @@ Q: if resnet (concept, not the og model) is so great, why isn't everything resid
   * related to above where they test loads of variants of (s|f)* layers (s for self attention). No scaling on f layers
 * [Residual Networks Behave Like Ensembles of Relatively Shallow Networks](https://arxiv.org/abs/1605.06431)
   * unroll a deep residual network to view as a sum of paths, like in the circuits paper
-  * for `-f-g-` you get paths `-----` `-f---` `---g-` -f-g-`
+  * for `-f-g-` you get paths `-----` `-f---` `---g-` `-f-g-`
 * [Improving neural networks by preventing co-adaptation of feature detectors](https://arxiv.org/abs/1207.0580)
   * dropout
   * randomly zero out parameters during training, average outputs at test time. Summing log-probs is same as geometric mean of "experts"
   * but then they also do a weird renormalization of weights (or is it inputs, I'm confused) if they exceed an L2 constraint. and a weird initialization
   * the paper uses p for probability of element to be 1, and pytorch uses p for probability of element to be 0, so paper multiplies by 1/p in the forward pass and torch does 1/(1-p)
+* [Layer Normalization](https://arxiv.org/abs/1607.06450)
+  * confusing b/c they present it as though its norming a layer's weights (like weight normalization, where w: vector = vg/norm(v) and v and g are learned), but in torch it just acts on the data passing through. I find the neuron focus confusing
+  * interesting they motivate it as a way to speed up learning
+  * as a post layer norm, like layernorm(mlp(x)), it makes the output invariant to scaling of the mlp's matrix
+  * as a pre layer norm like mlp(layernorm(x)) I think it makes the output invariant to scaling of the data
+  * post layer norm was the original architecture in transformer, but pre layer norm is the gpt2+ standard
+  * table 1 gives more invariance properties
+  * pre layer norm seems to make sense b/c (I think) your layers experience less covariate shift (from shimodaira 2000 via batch norm), as your layers learn, you don't want a doubly constantly moving target, obv there will be refinement in the residual stream representation, but don't make it harder than it needs to be with shifts or scaling to worry about too
+  * out = (x - mean(x) / sqrt(var(x) + eps)) * w + b where w and b are learned. applied to each input independently (or configurable by shape)
+* [Understanding and Improving Layer Normalization](https://arxiv.org/abs/1911.07013)
+  * adanorm: y = f(centerscale(x)) * x where centerscale(x) = (x - mean(x))/std(x) and f(x) is uniquely C(1-kx) by their requirements for constants C and k
+  * derivatives of mean and variance ie backward more important than forward normalization
+  * w and b parameters of layernorm can lead to overfitting
+  * 2.1 eqn 1 defines h as dot(g, N(x)) + b but that has to be a typo right? must mean hadamard (entrywise product)
+* [Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability](https://arxiv.org/abs/2305.08746)
+  * regularize overall layer weights by `d_ijk|w_ijk|` (locality; 2d euclidean distance between neuron (rows) after laying out layer neurons (matrix)) and occasionally trying row swapping within a layer if it minimizes this loss too since they say it gd gets stuck in a local minima
+  * pretty pictures
+  * the mnist pictures show the last layer of digit "neurons" (I suppose its a fine term when visualizing rows as dots but I still have an aversion to it) with one dot per digit arranged in a circle; did they use a circular layout in the distance calculation? I guess it is because the location of some digits change (see last page). But idk if that is a useful layer to have positions on; I guess if the data were skewed so all 9's were in the top right and all 1's were in the bottom left, then maybe. The dots layout is still confusing me a bit b/c for example the input layer is a 2d grid of dots where each dot is a scalar, but then the actual layers are one dot per row/matrix (for 2d/3d) right?
 * [https://github.com/lucidrains/x-transformers](https://github.com/lucidrains/x-transformers)
 * [https://github.com/karpathy/minGPT/blob/master/mingpt/model.py](https://github.com/karpathy/minGPT/blob/master/mingpt/model.py)