-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tiling for uneven tile sizes #138
Comments
Speaking of exotic sizes, I did some experiments in the past 2 weeks and found that tile size == 1 for K (i.e. In particular, see https://github.com/google/iree-llvm-sandbox/blob/main/python/examples/matmul/bench.py#L62 I do not have recommendations re pipelining yet but I added a simple transformation for targeted unrolling and will play with pipelining on my end too soonish. |
Funny, your tile sizes:
Are not too dissimilar to mine:
I also found that using |
It depends on the case we want to handle. If we have a dynamic loop where we know the number of iteration is greater than the number of stages, it is a fairly simple change. If we want to also support the case where the number of iteration is potentially low (case that wouldn't benefit from pipelining but we may still want to support for functionality) it is much harder as we need to handle the case where the kernel loop is never executed which would be harder to handle and would generate suboptimal code. As @nicolasvasilache pointed out we could pick a tile size of 1 for K and let the backend potentially do the dynamic unroll. Another solution would be to peel out iterations to get to a number of iterations divisible by the tile size. |
You could also add a level of tiling just K with peeling. |
Just to be on the same page, I am calling I am not after full functionality, and I am happy that for very small loops ( When you say "simple", do you mean simple for you or simple in general also for a newbie like me? :) I am not sure on X86, but on AArch64 pipelining is disabled because apparently it does not work super well. What I gathered from talking with back-end engineers here, is that it would be better to pipeline in MLIR and let the back-end optimize straight-line blocks (which is already quite hard, it seems). |
@nicolasvasilache Ah I see what you mean (maybe). I could tile&peel K by 512 so that we have a full separation when I apply pipelineing. Doesn't seem like a bad idea, I will also try this out, thanks! |
What matters is actually the number of iteration so it is When I wrote the transformation I was defensive and wanted to make sure nobody accidentally generate incorrect code, that's why I added a check to make sure the number of iteration was large enough.
It's not a one line change but it shouldn't be very complicated even for someone new to the code. It doesn't change the logic overall. The main difference is that we need to change the loop boundaries usage here to be Values instead of being a constant and update the associated math to generate ops. The rest of the logic stays unchanged. And we also need to add an option so that user guarantees that the number of iteration is large enough to make sure we nobody mis-compiles by mistake. |
Ok, I have implemented it in the way you suggested, and you were right, it was simple and performance are back on track. Unfortunately, I think that we might have situations where we have to take care about edge cases. For instance when I will jot down some experiments and let you guys know. |
Ok, so I played a bit with everything, and this is the problem The problemWhen we tile with uneven sizes the loop becomes dynamic. Considering the loop over The problem is that if the remainder is quite small (smaller than the number of stages), we might execute more operations than necessary The solutionI came up with two solutions: I added an option to enable solution a), to enable solution b) or to disable dynamic loops (i.e., disabling the pipelining if the loop has non-constant boundaries). @ThomasRaoux this is unfortunately in the main codebase. If this solution sounds fine, do you think it is fine to open a PR? It would be my first on the main repo, so would you mind if I added you as reviewer? Thanks, |
Solution a) is potentially incorrect right? Wouldn't executing more iteration potentially causes out of bound accesses? Solution b) sounds like either doing loop peeling (basically peeling
Yes, patches are welcome and I'm happy to review them. Note that mlir doesn't use github PRs (https://mlir.llvm.org/getting_started/Contributing/#contributing-code). |
Hi @ThomasRaoux , |
Sorry I meant peeling
Anyway it sounds like we agree. BTW if you have some good improvements to the pipeliner feel free to upstream them anyway when you have time, it might be useful next time :) |
Hi all,
While I finally have some satisfying numbers with my specific 2048^3 experiment (I will push the latest transforms as soon as possible), I thought it was time to try out more exotic sizes.
This is what I found, assuming micro-tiles of 8x8 and
k_c
(i.e., reduction axis tiling) equal to 512.When M==2047
Everything is not too bad here. The packing is fine, while in the micro-kernel contains
if-else
statements to avoid stepping out-of-memory of the output matrixC
.When N == 2047
Same story. Packing fine and if we use
split-vector-transfers-to=vector-transfers
performance are not affected (if we usenone
instead performance decrease slightly, to-be-investigated). This is not super-important right now, but @nicolasvasilache have you thought about the possibility of splitting the macro-kernel loop (in a main + remainder) instead of addingif-else
clauses inside the micro-kernel? How hard would it be?When K == 2047
We have issues here: about 15% performance hit.
The problem is that the micro-kernel loop is now a dynamic loop: this is good per se, because we don't want to spend time adding and multiplying zeros, but it has the draw-back that the pipelineing pass won't work for dynamic bounds. There is a check here:
That doesn't make the pass to work. @ThomasRaoux I didn't start yet having a look, but how hard do you think extending the pass to dynamic loop boundaries would be ?
Thanks,
Giuseppe
The text was updated successfully, but these errors were encountered: