AdaptiveBramich takes an excessive amount of time #14

martinberoiz · 2018-03-27T21:59:52Z

Even for same complexity as Bramich

varun-magesh · 2020-07-15T18:21:17Z

It seems like the vast majority of time in Adaptive Bramich is spent in multiply_and_sum, which seems like it could be optimized by outsourcing it to a GPU. I plan on implementing this and measuring any gains there might be.
Here’s a sample-based profile of AdaptiveBramich with degree=2.

       2   0.0%   0.0%   241038  99.9% PyEval_CallObjectWithKeywords
       9   0.0%   0.0%   241038  99.9% _PyEval_EvalFrameDefault
       4   0.0%   0.0%   241038  99.9% _PyFunction_FastCallDict
       1   0.0%   0.0%   240977  99.9% _PyList_Extend
       0   0.0%   0.0%   240969  99.9% PyObject_Call
       0   0.0%   0.0%   240958  99.9% main
       0   0.0%   0.0%   240957  99.9% PyErr_Print
       0   0.0%   0.0%   240957  99.9% Py_Main
       0   0.0%   0.0%   240956  99.9% __libc_start_main
       0   0.0%   0.0%   240956  99.9% _start
       0   0.0%   0.0%   233150  96.7% varconv_gen_matrix_system
      95   0.0%   0.0%   233147  96.7% build_matrix_system
  224115  92.9%  93.0%   224115  92.9% multiply_and_sum
    3426   1.4%  94.4%     9802   4.1% __pow
    4034   1.7%  96.1%     8936   3.7% fill_c_matrices_for_kernel
    2663   1.1%  97.2%     7690   3.2% convolve2d_adaptive
       0   0.0%  97.2%     7690   3.2% varconv_convolve2d_adaptive
    6339   2.6%  99.8%     6339   2.6% __ieee754_pow_sse2
       0   0.0%  99.8%      188   0.1% _PyObject_FastCallDict
       0   0.0%  99.8%      147   0.1% __GI___clone
       0   0.0%  99.8%      147   0.1% start_thread
       0   0.0%  99.8%      132   0.1% PyObject_CallFunctionObjArgs
       5   0.0%  99.8%      129   0.1% PyObject_GetAttr
     127   0.1%  99.8%      127   0.1% _init@be0
       1   0.0%  99.8%      119   0.0% blas_thread_server@314f60
       0   0.0%  99.8%      112   0.0% compression_decompress_hdu
     110   0.0%  99.9%      110   0.0% __GI___sched_yield
       0   0.0%  99.9%      106   0.0% ffgpv
       0   0.0%  99.9%      106   0.0% ffgpve
       0   0.0%  99.9%      106   0.0% fits_read_compressed_img
       0   0.0%  99.9%      106   0.0% fits_read_compressed_img_plane
       0   0.0%  99.9%      106   0.0% fits_read_compressed_pixels
      33   0.0%  99.9%       99   0.0% imcomp_decompress_tile
       2   0.0%  99.9%       62   0.0% PyCFunction_Call
       0   0.0%  99.9%       61   0.0% _PyObject_CallMethodIdObjArgs
       0   0.0%  99.9%       60   0.0% PyImport_ImportModuleLevelObject
      58   0.0%  99.9%       58   0.0% fits_rdecomp
       0   0.0%  99.9%       48   0.0% _PyObject_FastCallKeywords
       0   0.0%  99.9%       44   0.0% PyBytes_Size
       0   0.0%  99.9%       41   0.0% PyTuple_Pack
       1   0.0%  99.9%       40   0.0% PyUnicode_FromFormat
      37   0.0%  99.9%       37   0.0% _init@b7f0
      36   0.0% 100.0%       36   0.0% inner_advanced_thread
       1   0.0% 100.0%       28   0.0% blas_thread_server@314f50
       0   0.0% 100.0%       19   0.0% _Py_bytes_endswith
.
.
.
all others consume 0% cpu time

martinberoiz · 2020-08-04T21:32:29Z

I'm so sorry that this notification went right into spam folder in my email so I just read this. I'll take a look at it.

varun-magesh · 2020-08-05T17:35:25Z

No worries! I have already implemented multiply_and_sum in CUDA, though I haven’t convinced setuptools to play nicely with CUDA/nvcc yet. I’ll make a PR once I get it figured out; my work in progress is at https://github.com/varun-iyer/ois.

I am keeping everything compatible with a CPU only compilation. If this feature is one that you think should be merged into the mainline, let me know how you would prefer optional CUDA usage to be integrated --- an external Makefile which builds a shared object used by setuptools, or by setuptools using nvcc directly.

varun-magesh · 2020-08-19T19:13:47Z

I have a solution for this implemented at https://github.com/varun-iyer/ois. It is already a significant improvement over CPU-only calculations, and I am still making optimizations in the kernel code. Note the results below were not rigorously produced; they were made by choosing some random science images and cutting them to a particular size before running them through Adaptive Bramich with poly_degree=2. CUDA enabled calculations are in blue and standard calcs are in orange.

However, I have some concerns about portability. I am using some features of CUDA which rely on having a decent GPU architecture (sm_60+), and issues with setuptools and static linking require oistools to be built separately as an dynamically linked library which needs to be added to the LD_LIBRARY_PATH. Making this properly portable and pip-installable probably requires more knowledge of CUDA and setuptools than I possess.

If you think that these features would be important to incorporate, I can invest some time in making it a little more flexible, and see if I can get it to a pull-requestable place.
Otherwise, I will be maintaining my fork separately, and hope that it serves as a helpful reference implementation for GPU-based acceleration.

martinberoiz · 2020-08-19T23:08:19Z

Hi,

That looks great, thanks for taking the time. I'm interested in incorporating your work.

Do you think you can make a pull request to merge with the cuda branch I just made? That could be a compromise instead of me merging to master just yet.

One of my plans was to make (aside from the python module) a stand alone C program as well and compile it with the makefile. I haven't written that part yet but I could benefit from the PR.

varun-magesh · 2020-09-03T17:58:50Z

I haven’t made a PR because I since discovered some issues with the code, and haven’t been able to resolve them quite yet. On larger images (including our science images which are ~3000x2000), the GPU memory overflows and the program crashes. Science images tend to be pretty big, so I need to develop a way to process chunks of the image at a time depending on the available GPU memory. Once I’ve figured out how to do this and tested it out on some science images, I will do that PR.

martinberoiz · 2020-09-03T18:35:37Z

Maybe using the option gridshape can help. Using the grid you basically chop the images in smaller sub-images on a grid.
For example if your image is 3000 rows by 2000 columns:

from ois import optimal_system
diff, opt, krn, bkg = optimal_system(image, refimage, method="Bramich", gridshape=(3, 2))

This would effectively process 1000x1000 pix sub-images each time.

varun-magesh · 2020-09-03T18:41:37Z

Thanks for pointing this out! I think that this would be an acceptable workaround for our group’s applications, and some kind of gridding seems to be recommended in section 2.4 of the Bramich paper. Do you think I should attempt a programmatic solution to the original issue, or assume that applications with large images would grid them anyway? I haven’t thoroughly investigated it, but I believe images well above 1000x1000 work on our GTX1070 with 8GB of GPU card memory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AdaptiveBramich takes an excessive amount of time #14

AdaptiveBramich takes an excessive amount of time #14

martinberoiz commented Mar 27, 2018

varun-magesh commented Jul 15, 2020 •

edited

Loading

martinberoiz commented Aug 4, 2020

varun-magesh commented Aug 5, 2020

varun-magesh commented Aug 19, 2020

martinberoiz commented Aug 19, 2020

varun-magesh commented Sep 3, 2020

martinberoiz commented Sep 3, 2020

varun-magesh commented Sep 3, 2020

AdaptiveBramich takes an excessive amount of time #14

AdaptiveBramich takes an excessive amount of time #14

Comments

martinberoiz commented Mar 27, 2018

varun-magesh commented Jul 15, 2020 • edited Loading

martinberoiz commented Aug 4, 2020

varun-magesh commented Aug 5, 2020

varun-magesh commented Aug 19, 2020

martinberoiz commented Aug 19, 2020

varun-magesh commented Sep 3, 2020

martinberoiz commented Sep 3, 2020

varun-magesh commented Sep 3, 2020

varun-magesh commented Jul 15, 2020 •

edited

Loading