-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extract_central_slices_rfft performance #62
Comments
Hey! Thanks for this analysis Am I reading right that this would save 15% of 20% of the total execution time, i.e. 3% of the total? The biggest point for optimisation I haven't done yet is limiting the fourier coefficients used when extracting/inserting - that would be a huge speedup in lots of cases |
You are correct, the overall impact will be <4%, and if you manage to compile the code, there will probably be no differences (I am not able to compile the code into something useful yet..., but that is another story), so I am not saying we should use this trick. The point of this issue is to keep track of this bottleneck (almost 30% of the running time employed in applying the mask) so we can think of how to deal with it (my current suggestion was just one of the many trials I run). I am quite surprised by the huge impact that these two lines of code have on the overall performance of my projection matching code. |
I'm with you, really appreciate the effort! I agree it's quite big but it's a lot of elements being modified at once so it doesn't feel so surprising - if you don't implement it this way then you have to do twice as much work in if we add a similar function |
I think that this will be a quite useful feature if we can cut down computing time by a > 2x factor, but I am pretty sure that the implementation will be trickier that it sounds, so it is totally up to you if you think this is worth it. One comment about the advantage of using the mask over doing twice rotated_central_slice_grid. Perhaps using torch.cuda.stream() can parallelize the two executions efficiently, leading to a smaller execution time. I have never tried but who knows |
Hi,
I have been using a profiler for the function extract_central_slices_rfft and found that the conjugate_mask is responsible for an important fraction of the whole execution
libtilt/src/libtilt/projection/project_fourier.py
Line 96 in 90f09b7
libtilt/src/libtilt/projection/project_fourier.py
Line 108 in 90f09b7
These are the results of my profiler (setting CUDA_LAUNCH_BLOCKING=1 to avoid asynchronous run).
The following changes can speed the code a bit.
The text was updated successfully, but these errors were encountered: