Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Const assignment map fusion #1685

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from
Draft

Conversation

pratyai
Copy link
Collaborator

@pratyai pratyai commented Oct 15, 2024

Since scheduling multiple map kernels with very little internal operation can have a large overhead, sometimes we would like to fuse two such maps if possible. Constant assignment maps are such a case. If two maps assign the same value for every element in a subset of the underlying array (and the subset is not dependent on the array in any way), then:

  • If the two maps' subsets are identical, we can simply fuse the two maps by moving the body of one map to another (with appropriate wiring), since we know the order of the assignments do not matter here, and can even be deduplicated in some cases.
  • If the two maps' subsets are not identical, even then we can occasionally fuse them using a grid-strided loop pattern (which essentially emulates a conditional to ensure that only the appropriate elements are assigned).

Motivating Example

Consider the following graphs, all representing a computation that assigns 1 to the boundary of a 2D domain. The first table represents the graphs scheduled for CPU, the second for GPU.

Device Original w/o GSL with GSL
CPU 2d-orig 2d-no-gsl 2d-with-gsl
GPU 2d-orig-GPU 2d-no-gsl-GPU 2d-with-gsl-GPU

Performance

We have profiled a 2D and a 3D boundary initialization, both on CPU and GPU (both on Davinci cluster).
Benchmark scripts and reports are to be found in https://github.com/pratyai/dace/tree/bench-const-assignment-fusion
I will be quoting the performance summaries in further comments.

Comment on GPU performance

The GPU transformation adds additional operation copying the entire array to and from GPU memory, resulting in O(n^d) main <=> GPU movement, whereas the assignment itself only touches O(n^{d-1}) elements. However, this is because the benchmark itself does not do anything else but the assignment. In real computations, we would likely need to move the entire array anyway.

Because of this, it is probably better to just focus on the combined performance of the map kernels here.

@pratyai
Copy link
Collaborator Author

pratyai commented Oct 15, 2024

CPU Results (comment might be updated)

benchmark_const_assignment_fusion_test_assign_top_row_0 0.13079101336188614 ms
===2D boundary init: original op===
Instrumentation report
SDFG Hash: 
---------------------------------------------------------------------------
Element        Runtime (ms)   
               Min            Mean           Median         Max            
---------------------------------------------------------------------------
SDFG (0)                                                                   
|:                                                                         
|              0.108          0.128          0.131          0.366          
---------------------------------------------------------------------------
|-State (0)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_top_row_47)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.007          0.020          0.021          0.045          
---------------------------------------------------------------------------
|-State (1)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_bottom_row_53)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.005          0.006          0.006          0.010          
---------------------------------------------------------------------------
|-State (2)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_left_col_59)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.004          0.005          0.005          0.024          
---------------------------------------------------------------------------
|-State (3)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_right_col_65)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.004          0.006          0.005          0.100          
---------------------------------------------------------------------------


benchmark_const_assignment_fusion_test_assign_top_row_0 0.11151551734656096 ms
===2D boundary init: fused op w/o. grid-strided loop===
Instrumentation report
SDFG Hash: 
---------------------------------------------------------------------------
Element        Runtime (ms)   
               Min            Mean           Median         Max            
---------------------------------------------------------------------------
SDFG (0)                                                                   
|:                                                                         
|              0.090          0.111          0.112          0.920          
---------------------------------------------------------------------------
|-State (0)                                                                
| |-Node (4, Map map_fusion_wrapper)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.014          0.021          0.018          0.151          
---------------------------------------------------------------------------
| |-Node (9, Map map_fusion_wrapper)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.004          0.006          0.005          0.054          
---------------------------------------------------------------------------


benchmark_const_assignment_fusion_test_assign_top_row_0 0.10572001338005066 ms
===2D boundary init: fused op with grid-strided loop===
Instrumentation report
SDFG Hash: 
---------------------------------------------------------------------------
Element        Runtime (ms)   
               Min            Mean           Median         Max            
---------------------------------------------------------------------------
SDFG (0)                                                                   
|:                                                                         
|              0.084          0.103          0.106          0.393          
---------------------------------------------------------------------------
|-State (0)                                                                
| |-Node (8, Map map_fusion_wrapper)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.009          0.022          0.022          0.024          
---------------------------------------------------------------------------


benchmark_const_assignment_fusion_test_assign_bounary_3d 0.23336600861512125 ms
===3D boundary init: original op===
Instrumentation report
SDFG Hash: 
---------------------------------------------------------------------------
Element        Runtime (ms)   
               Min            Mean           Median         Max            
---------------------------------------------------------------------------
SDFG (0)                                                                   
|:                                                                         
|              0.214          0.233          0.233          0.497          
---------------------------------------------------------------------------
|-State (0)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_top_face_90)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.020          0.027          0.027          0.054          
---------------------------------------------------------------------------
|-State (1)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_bottom_face_9)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.013          0.015          0.014          0.038          
---------------------------------------------------------------------------
|-State (2)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_front_face_10)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.012          0.013          0.013          0.035          
---------------------------------------------------------------------------
|-State (3)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_back_face_108)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.010          0.011          0.011          0.038          
---------------------------------------------------------------------------
|-State (4)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_left_face_114)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.029          0.031          0.031          0.051          
---------------------------------------------------------------------------
|-State (5)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_right_face_12)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.024          0.026          0.025          0.051          
---------------------------------------------------------------------------


benchmark_const_assignment_fusion_test_assign_bounary_3d 0.19121551304124296 ms
===3D boundary init: fused op w/o. grid-strided loop===
Instrumentation report
SDFG Hash: 
---------------------------------------------------------------------------
Element        Runtime (ms)   
               Min            Mean           Median         Max            
---------------------------------------------------------------------------
SDFG (0)                                                                   
|:                                                                         
|              0.173          0.192          0.191          0.427          
---------------------------------------------------------------------------
|-State (0)                                                                
| |-Node (4, Map map_fusion_wrapper)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.026          0.033          0.033          0.056          
---------------------------------------------------------------------------
| |-Node (9, Map map_fusion_wrapper)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.018          0.020          0.020          0.082          
---------------------------------------------------------------------------
| |-Node (14, Map map_fusion_wrapper)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.037          0.043          0.042          0.066          
---------------------------------------------------------------------------


benchmark_const_assignment_fusion_test_assign_bounary_3d 0.18826551968231797 ms
===3D boundary init: fused op with grid-strided loop===
Instrumentation report
SDFG Hash: 
---------------------------------------------------------------------------
Element        Runtime (ms)   
               Min            Mean           Median         Max            
---------------------------------------------------------------------------
SDFG (0)                                                                   
|:                                                                         
|              0.170          0.188          0.188          0.469          
---------------------------------------------------------------------------
|-State (0)                                                                
| |-Node (14, Map map_fusion_wrapper)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.091          0.099          0.096          0.200          
---------------------------------------------------------------------------

@pratyai
Copy link
Collaborator Author

pratyai commented Oct 15, 2024

GPU Results (comment might be updated)

benchmark_const_assignment_fusion_test_assign_top_row_0 167.06380699179135 ms
===2D boundary init: original op===
Instrumentation report
SDFG Hash: 
---------------------------------------------------------------------------
Element        Runtime (ms)   
               Min            Mean           Median         Max            
---------------------------------------------------------------------------
SDFG (0)                                                                   
|:                                                                         
|              166.200        167.907        167.064        172.466        
---------------------------------------------------------------------------
|-State (0)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_top_row_47)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.002          0.004          0.004          0.009          
---------------------------------------------------------------------------
|-State (1)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_bottom_row_53)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.004          0.005          0.004          0.018          
---------------------------------------------------------------------------
|-State (2)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_left_col_59)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.004          0.006          0.006          0.014          
---------------------------------------------------------------------------
|-State (3)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_right_col_65)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.005          0.006          0.006          0.014          
---------------------------------------------------------------------------


benchmark_const_assignment_fusion_test_assign_top_row_0 170.15381151577458 ms
===2D boundary init: fused op w/o. grid-strided loop===
Instrumentation report
SDFG Hash: 
---------------------------------------------------------------------------
Element        Runtime (ms)   
               Min            Mean           Median         Max            
---------------------------------------------------------------------------
SDFG (0)                                                                   
|:                                                                         
|              166.765        169.892        170.154        172.003        
---------------------------------------------------------------------------
|-State (0)                                                                
| |-Node (6, Map map_fusion_wrapper)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.003          0.004          0.004          0.008          
---------------------------------------------------------------------------
| |-Node (11, Map map_fusion_wrapper)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.008          0.009          0.008          0.022          
---------------------------------------------------------------------------


benchmark_const_assignment_fusion_test_assign_top_row_0 165.11968348640949 ms
===2D boundary init: fused op with grid-strided loop===
Instrumentation report
SDFG Hash: 
---------------------------------------------------------------------------
Element        Runtime (ms)   
               Min            Mean           Median         Max            
---------------------------------------------------------------------------
SDFG (0)                                                                   
|:                                                                         
|              164.683        165.159        165.120        167.484        
---------------------------------------------------------------------------
|-State (0)                                                                
| |-Node (10, Map map_fusion_wrapper)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.007          0.008          0.008          0.013          
---------------------------------------------------------------------------


benchmark_const_assignment_fusion_test_assign_bounary_3d 50.67180350306444 ms
===3D boundary init: original op===
Instrumentation report
SDFG Hash: 
---------------------------------------------------------------------------
Element        Runtime (ms)   
               Min            Mean           Median         Max            
---------------------------------------------------------------------------
SDFG (0)                                                                   
|:                                                                         
|              49.457         50.629         50.672         51.666         
---------------------------------------------------------------------------
|-State (0)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_top_face_90)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.008          0.008          0.008          0.013          
---------------------------------------------------------------------------
|-State (1)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_bottom_face_9)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.008          0.009          0.010          0.023          
---------------------------------------------------------------------------
|-State (2)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_front_face_10)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.008          0.008          0.008          0.024          
---------------------------------------------------------------------------
|-State (3)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_back_face_108)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.006          0.008          0.008          0.027          
---------------------------------------------------------------------------
|-State (4)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_left_face_114)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.042          0.044          0.044          0.053          
---------------------------------------------------------------------------
|-State (5)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_right_face_12)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.044          0.045          0.045          0.055          
---------------------------------------------------------------------------


benchmark_const_assignment_fusion_test_assign_bounary_3d 50.77531602000818 ms
===3D boundary init: fused op w/o. grid-strided loop===
Instrumentation report
SDFG Hash: 
---------------------------------------------------------------------------
Element        Runtime (ms)   
               Min            Mean           Median         Max            
---------------------------------------------------------------------------
SDFG (0)                                                                   
|:                                                                         
|              50.484         50.773         50.775         51.235         
---------------------------------------------------------------------------
|-State (0)                                                                
| |-Node (6, Map map_fusion_wrapper)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.008          0.008          0.008          0.012          
---------------------------------------------------------------------------
| |-Node (11, Map map_fusion_wrapper)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.006          0.008          0.008          0.020          
---------------------------------------------------------------------------
| |-Node (16, Map map_fusion_wrapper)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.050          0.052          0.052          0.065          
---------------------------------------------------------------------------


benchmark_const_assignment_fusion_test_assign_bounary_3d 49.49491852312349 ms
===3D boundary init: fused op with grid-strided loop===
Instrumentation report
SDFG Hash: 
---------------------------------------------------------------------------
Element        Runtime (ms)   
               Min            Mean           Median         Max            
---------------------------------------------------------------------------
SDFG (0)                                                                   
|:                                                                         
|              49.321         49.533         49.495         50.064         
---------------------------------------------------------------------------
|-State (0)                                                                
| |-Node (16, Map map_fusion_wrapper)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.055          0.057          0.057          0.063          
---------------------------------------------------------------------------

@pratyai pratyai added the no-ci Do not run any CI or actions for this PR label Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-ci Do not run any CI or actions for this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant