RangeInAxisLayer, allow `axis` to be the rec time dim #820

Zettelkasten · 2021-12-03T12:17:58Z

Similar to CumsumLayer and WindowLayer, we could support this.
This would then have the same behavior as RecStepInfoLayer (:i), i.e. RangeInAxisLayer would just return a single int if it operates step-by-step.

The text was updated successfully, but these errors were encountered:

albertz · 2021-12-03T12:19:46Z

Well yea that would be nice. But why wouldn't you just use layer :i in that case?

albertz · 2021-12-03T12:22:23Z

I'm not sure if this maybe makes RangeInAxisLayer too complicated. Or maybe even misleading. E.g. if the rec time dim already exists before, maybe you really want to have such range also inside the loop? (Which is allowed, unless we enforce unique dim tags (#632).)

Zettelkasten · 2021-12-03T12:33:58Z

My use case would be like this: I have some relatively complicated self-attention code, which I want to work non-recurrent and recurrently. Currently, for the recurrent case, we just simply use cum_concat to collect all frames so far and then use our non-recurrent implementation on that. Then, we just gather the last step again. (During training, this is fine, during inference, this is very inefficient of course. But we don't really care currently).
So to gather this last frame in the recurrent case, I have this code:

  if inside_rec_layer:
    d[output + '_att'] = {'class': 'gather', 'from': [output + '_att_all'], 'position': ':i', 'axis': query_axis}
  else:
    d[output + '_att'] = {'class': 'copy', 'from': [output + '_att_all']}

If RangeInAxisLayer supported to operate on the rec dim, I could also write

 d['step'] = {'class': 'range_in_axis', 'axis': query_axis}
 d[output + '_att'] = {'class': 'gather', 'from': 'step', 'axis': query_axis}

because that works in no matter if I'm inside a rec layer or not.

But that would also be (slightly?) less efficient in the non-recurrent case because of this "no-op gather" then. So maybe this is a bad idea anyway.

albertz · 2021-12-03T12:55:04Z

In this a case where rec automatic optimization breaks it? I.e. it works without rec optimization but then breaks when moved out? Then yes, this must be fixed. But I'm not sure if you have that.

Or is this for sharing code between having an auto-regressive formulation and a (non-autoregressive) sequence-level formulation? I don't see the concern for that too much, as you also have other differences like CumConcatLayer vs ReinterpretDataLayer with set_dim_tags.

Currently, for the recurrent case, we just simply use cum_concat to collect all frames so far and then use our non-recurrent implementation on that. Then, we just gather the last step again.

This sounds like what you do in auto-regressive self attention as well. But then when you do the attention, i.e. calculate the weighted sum (or doing matmul, using DotLayer), you reduce that axis away.

So now instead of attention, you do sth different (e.g. LSTM or so) and then take the last step?

But why would you not use :i for the gather position? How is your proposed code with RangeInAxisLayer different to just using :i?

(Btw, maybe even a bit better would be to use LengthLayer from the output on the new axis, and then seq_len - 1. When you have some initial state for the CumConcatLayer, just using :i could be wrong.)

Isn't the problem just that GatherLayer maybe handles the implicit dim not correctly or not efficiently? But this is no matter if you use RangeInAxisLayer or :i.

Zettelkasten · 2021-12-03T16:46:46Z

Or is this for sharing code between having an auto-regressive formulation and a (non-autoregressive) sequence-level formulation? I don't see the concern for that too much, as you also have other differences like CumConcatLayer vs ReinterpretDataLayer with set_dim_tags.

Yes, this is the case (this is not an issue with the rec optimizations, but just about this).
Yeah true, we anyway need a case distinction for setting the dim tags correctly.

Currently, for the recurrent case, we just simply use cum_concat to collect all frames so far and then use our non-recurrent implementation on that. Then, we just gather the last step again.

This sounds like what you do in auto-regressive self attention as well. But then when you do the attention, i.e. calculate the weighted sum (or doing matmul, using DotLayer), you reduce that axis away.

So now instead of attention, you do sth different (e.g. LSTM or so) and then take the last step?

Yeah well, conceptually yes. My "attention function" calculation depends not only on a single query, but on all queries (I am aware that this means that rec optimization changes the semantics of my network if I'm not careful).

But why would you not use :i for the gather position? How is your proposed code with RangeInAxisLayer different to just using :i?

Using :i would work for non-autoregressive self attention. It is really just to have the same net dict in both cases. Which - admittedly - maybe is not a good reason.

(Btw, maybe even a bit better would be to use LengthLayer from the output on the new axis, and then seq_len - 1. When you have some initial state for the CumConcatLayer, just using :i could be wrong.)

Ah thanks, good point, yeah, that's a bit dangerous.

Isn't the problem just that GatherLayer maybe handles the implicit dim not correctly or not efficiently? But this is no matter if you use RangeInAxisLayer or :i.

I don't think gather layer is part of my problem?

Zettelkasten · 2021-12-03T17:01:15Z

I'm not sure if this maybe makes RangeInAxisLayer too complicated. Or maybe even misleading. E.g. if the rec time dim already exists before, maybe you really want to have such range also inside the loop? (Which is allowed, unless we enforce unique dim tags (#632).)

Hm, yeah, that is a bit weird. I would also find it misleading if RangeInAxisLayer behaves differently than other layers when use the rec time dim as axis to operate on.

I will close this issue for now, I think RangeInAxisLayer could still support operating on the rec time dim, but I agree that at least for my use case, it's not really necessary. If we want to add this, then maybe with a better use case in mind, so that we can decide on which behavior works best.

Zettelkasten closed this as completed Dec 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RangeInAxisLayer, allow `axis` to be the rec time dim #820

RangeInAxisLayer, allow `axis` to be the rec time dim #820

Zettelkasten commented Dec 3, 2021

albertz commented Dec 3, 2021

albertz commented Dec 3, 2021 •

edited

Loading

Zettelkasten commented Dec 3, 2021

albertz commented Dec 3, 2021

Zettelkasten commented Dec 3, 2021

Zettelkasten commented Dec 3, 2021

RangeInAxisLayer, allow axis to be the rec time dim #820

RangeInAxisLayer, allow axis to be the rec time dim #820

Comments

Zettelkasten commented Dec 3, 2021

albertz commented Dec 3, 2021

albertz commented Dec 3, 2021 • edited Loading

Zettelkasten commented Dec 3, 2021

albertz commented Dec 3, 2021

Zettelkasten commented Dec 3, 2021

Zettelkasten commented Dec 3, 2021

RangeInAxisLayer, allow `axis` to be the rec time dim #820

RangeInAxisLayer, allow `axis` to be the rec time dim #820

albertz commented Dec 3, 2021 •

edited

Loading