Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RangeInAxisLayer, allow axis to be the rec time dim #820

Closed
Zettelkasten opened this issue Dec 3, 2021 · 6 comments
Closed

RangeInAxisLayer, allow axis to be the rec time dim #820

Zettelkasten opened this issue Dec 3, 2021 · 6 comments

Comments

@Zettelkasten
Copy link
Member

Similar to CumsumLayer and WindowLayer, we could support this.
This would then have the same behavior as RecStepInfoLayer (:i), i.e. RangeInAxisLayer would just return a single int if it operates step-by-step.

@albertz
Copy link
Member

albertz commented Dec 3, 2021

Well yea that would be nice. But why wouldn't you just use layer :i in that case?

@albertz
Copy link
Member

albertz commented Dec 3, 2021

I'm not sure if this maybe makes RangeInAxisLayer too complicated. Or maybe even misleading. E.g. if the rec time dim already exists before, maybe you really want to have such range also inside the loop? (Which is allowed, unless we enforce unique dim tags (#632).)

@Zettelkasten
Copy link
Member Author

My use case would be like this: I have some relatively complicated self-attention code, which I want to work non-recurrent and recurrently. Currently, for the recurrent case, we just simply use cum_concat to collect all frames so far and then use our non-recurrent implementation on that. Then, we just gather the last step again. (During training, this is fine, during inference, this is very inefficient of course. But we don't really care currently).
So to gather this last frame in the recurrent case, I have this code:

  if inside_rec_layer:
    d[output + '_att'] = {'class': 'gather', 'from': [output + '_att_all'], 'position': ':i', 'axis': query_axis}
  else:
    d[output + '_att'] = {'class': 'copy', 'from': [output + '_att_all']}

If RangeInAxisLayer supported to operate on the rec dim, I could also write

 d['step'] = {'class': 'range_in_axis', 'axis': query_axis}
 d[output + '_att'] = {'class': 'gather', 'from': 'step', 'axis': query_axis}

because that works in no matter if I'm inside a rec layer or not.

But that would also be (slightly?) less efficient in the non-recurrent case because of this "no-op gather" then. So maybe this is a bad idea anyway.

@albertz
Copy link
Member

albertz commented Dec 3, 2021

In this a case where rec automatic optimization breaks it? I.e. it works without rec optimization but then breaks when moved out? Then yes, this must be fixed. But I'm not sure if you have that.

Or is this for sharing code between having an auto-regressive formulation and a (non-autoregressive) sequence-level formulation? I don't see the concern for that too much, as you also have other differences like CumConcatLayer vs ReinterpretDataLayer with set_dim_tags.

Currently, for the recurrent case, we just simply use cum_concat to collect all frames so far and then use our non-recurrent implementation on that. Then, we just gather the last step again.

This sounds like what you do in auto-regressive self attention as well. But then when you do the attention, i.e. calculate the weighted sum (or doing matmul, using DotLayer), you reduce that axis away.

So now instead of attention, you do sth different (e.g. LSTM or so) and then take the last step?

But why would you not use :i for the gather position? How is your proposed code with RangeInAxisLayer different to just using :i?

(Btw, maybe even a bit better would be to use LengthLayer from the output on the new axis, and then seq_len - 1. When you have some initial state for the CumConcatLayer, just using :i could be wrong.)

Isn't the problem just that GatherLayer maybe handles the implicit dim not correctly or not efficiently? But this is no matter if you use RangeInAxisLayer or :i.

@Zettelkasten
Copy link
Member Author

Or is this for sharing code between having an auto-regressive formulation and a (non-autoregressive) sequence-level formulation? I don't see the concern for that too much, as you also have other differences like CumConcatLayer vs ReinterpretDataLayer with set_dim_tags.

Yes, this is the case (this is not an issue with the rec optimizations, but just about this).
Yeah true, we anyway need a case distinction for setting the dim tags correctly.

Currently, for the recurrent case, we just simply use cum_concat to collect all frames so far and then use our non-recurrent implementation on that. Then, we just gather the last step again.

This sounds like what you do in auto-regressive self attention as well. But then when you do the attention, i.e. calculate the weighted sum (or doing matmul, using DotLayer), you reduce that axis away.

So now instead of attention, you do sth different (e.g. LSTM or so) and then take the last step?

Yeah well, conceptually yes. My "attention function" calculation depends not only on a single query, but on all queries (I am aware that this means that rec optimization changes the semantics of my network if I'm not careful).

But why would you not use :i for the gather position? How is your proposed code with RangeInAxisLayer different to just using :i?

Using :i would work for non-autoregressive self attention. It is really just to have the same net dict in both cases. Which - admittedly - maybe is not a good reason.

(Btw, maybe even a bit better would be to use LengthLayer from the output on the new axis, and then seq_len - 1. When you have some initial state for the CumConcatLayer, just using :i could be wrong.)

Ah thanks, good point, yeah, that's a bit dangerous.

Isn't the problem just that GatherLayer maybe handles the implicit dim not correctly or not efficiently? But this is no matter if you use RangeInAxisLayer or :i.

I don't think gather layer is part of my problem?

@Zettelkasten
Copy link
Member Author

I'm not sure if this maybe makes RangeInAxisLayer too complicated. Or maybe even misleading. E.g. if the rec time dim already exists before, maybe you really want to have such range also inside the loop? (Which is allowed, unless we enforce unique dim tags (#632).)

Hm, yeah, that is a bit weird. I would also find it misleading if RangeInAxisLayer behaves differently than other layers when use the rec time dim as axis to operate on.

I will close this issue for now, I think RangeInAxisLayer could still support operating on the rec time dim, but I agree that at least for my use case, it's not really necessary. If we want to add this, then maybe with a better use case in mind, so that we can decide on which behavior works best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants