Skip to content

stone-d-chen/prog-para-comp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

97 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Programming Parallel Computers

Exercises from

Currently some #1 and Top 3 solutions on the leaderboard.

Todos (potentially)

  • Figure out how to check the compiler type so I don't have to keep re-doing the intrinsics file switching gcc/msvc

Correlated Pairs

  • cp3a, down to 3.5 seconds, need to add avx512 support

AVX512 permute instructions

  • __m512d _mm512_permutexvar_pd (__m512i idx, __m512d a)

Median Filter

  • Hand rolled, median of medians
  • convert recursive -> iterative
  • refactor quicksort into a separate partition function

Updates

2024/2/3

  • okay so turns out there's a segfault probably related to WSL2 + Unified Memory issues...to resolve this I got a new laptop with good linux support lol

2023/29/07

  • really bizarre issue where changing the kernel line if(i \>= ny \|\| j \>= ny \|\| j \> i) return; to if(i >= ny || j >= ny) return; there's no issue?

2023/15/7

{width="619"}

2023/08/7

  • Implement float version, the fastest time! and less than 1.5; shook at 92 lines of code, 1.14, who are you rv?

2023/04/7

  • Back to avx2 but using more register re-use solidly puts me at #3 but still not below <3 seconds

  • I can maybe get down to < 3.0 with avx512 but it's a huge pain not being able to develop locally (...do I get a avx512 cpu just for this???? )

2023/03/7

  • Add avx512 support, huge pain since I don't have an avx512 cpu, shaved off 10% of time. Had a problem with unitialized variable warnings which the compiler is set to treat as an error, in the cvtps_pd instructions

2023/29/6

  • Turns out the test machines have avx512 support, which only helps a tiny bit (2c) due to memory constraints still

2023/27/6

  • Okay so turns out that 26/6 stuff was not helpful...I think I figure out why

  • In the mean time it seemed like just tuning the original vector output size to 4x3 or 4x4 was enough to drop to about 4.1 seconds, I also noticed that the test machines have avx512 support soooo probably should try that next

  • So this is the new kernel I tried, notably I did the commented out broadcast first, this had pretty large jumps especially in comparison to my typical way, so I increased register utilization but probably blew out the cache / pre-fetch

        for (int k = kStart; k < kEnd; ++k)
            for(s32 i = 0; i < Rows; ++i)
            {
                // f64x4 broadcast = BroadcastF64(&Data[DimInner * (Row + i) + k]);
                f64x4 broadcast = BroadcastF64(&DataT[DimOuter*k + (Row + i)]);
                for(s32 j = 0; j < Cols; ++j)
                {
                    f64x4 row = loadu ( &DataT [ (DimOuter * k) + (Col + j * VecWidth) ]);
                    DotProds[i][j] = DotProds[i][j] + (broadcast * row);
                }
            }

2023/26/6

  • omp dynamic was helpful....

  • Major Refactor

So currently the inner most computation does something like this:

  • We want to calculate result[x:x+3][y:y+3] in essentially 1 pass, because we only need to load 3n + 3n vectors to compute 9 outputs so a 9/6n ratio versus say 1/2n

  • We do this in vectors so we load in (...just had a thought, not 100% sure if the loop was properly unrolled)

    x0 = Data[x + 0][k : k+8]
    x1 = Data[x + 1][k : k+8]
    x2 = Data[x + 2][k : k+8]
    
    y0 = Data[k:k + 8][y + 0]
    y1 = Data[k:k + 8][y + 1]
    y2 = Data[k:k + 8][y + 2]

    This results in this asm where we can see the 6 loads and 9 FMAs

.L42:
        vmovupd ymm0, YMMWORD PTR [r9+rax*8]  // load0
        vmovupd ymm3, YMMWORD PTR [rsi+rax*8] // load1
        vmovupd ymm2, YMMWORD PTR [rcx+rax*8] // load2
        vmovupd ymm1, YMMWORD PTR [rdx+rax*8] // load3
        vfmadd231pd     ymm4, ymm0, ymm3
        vfmadd231pd     ymm13, ymm0, ymm2
        vfmadd231pd     ymm12, ymm0, ymm1
        vmovupd ymm0, YMMWORD PTR [r8+rax*8] // load4
        vfmadd231pd     ymm6, ymm3, ymm0
        vfmadd231pd     ymm10, ymm2, ymm0
        vfmadd231pd     ymm9, ymm1, ymm0
        vmovupd ymm0, YMMWORD PTR [rdi+rax*8] //load5
        add     rax, 4
        vfmadd231pd     ymm5, ymm3, ymm0
        vfmadd231pd     ymm8, ymm2, ymm0
        vfmadd231pd     ymm7, ymm1, ymm0

Okay so the issue now is that we're entering a memory bottleneck with larger sizes. We really need to up our computations per memory loads. It's easiest to see this by examining how matrix multiplies use element of the respective matrices.

From Eli Bendersky's Visualizing matrix multiplication as a linear combination we see that a column of the output is generated by a linear combination of the columns of the left matrix

but if we focus on a single element a we can also think of it as

  • any element of the right matrix participates in ny elements of the output
  • any element of the right matrix is multiplied by every element of a column of the left matrix
  • we locate that column by the row of the element

We can also think of this in a row oriented manner

  • any element of the left matrix participates in every element of a row of the right matix

  • we locate that row based on the column of the element.

So now our new strategy is to load in a single element of the left matrix

//EXTEMELY rough sketch
f64 data[16][4] = 
{
    0,1,2,3,
    4,5,6,7,
    8,9,10,11,
    12,13,14,15,
    0,1,2,3,
    4,5,6,7,
    8,9,10,11,
    12,13,14,15,
    0,1,2,3,
    4,5,6,7,
    8,9,10,11,
    12,13,14,15,
    0,1,2,3,
    4,5,6,7,
    8,9,10,11,
    12,13,14,15,
};

f64 data2[4][16] = 
{
    0,1,2,3,4,5,6,7, /*  */   8, 9, 10, 11, 12, 13, 14, 15,
    0,1,2,3,4,5,6,7, /*  */   8, 9, 10, 11, 12, 13, 14, 15,
    0,1,2,3,4,5,6,7, /*  */   8, 9, 10, 11, 12, 13, 14, 15,
    0,1,2,3,4,5,6,7, /*  */   8, 9, 10, 11, 12, 13, 14, 15,
};

f64 result[16][16] = {};

void kernel(f64* NormData, s32 Row, s32 Col)
{
    f64x4 DotProds[2][2] = {}

    for (int k = 0; k < nx; ++k)
    {
        f64x4 broadcast = Broadcast(data[0][k])

        f64x4 row0 = loadu(data[k][0:7])

        DotProds[0][0] += broadcast + row0;

        f64x4 row1 = loadu(data[k][8:16])   

        DotProds[0][1] += broadcast + row1;
    }
}

2023/25/6

  • cp3a blocking is working but still really slow, the gflops going down suggests I'll have to do some sort of blocking to get this to work
  • cp2b omp parallel
  • cp2c was confused why VecNormData[PaddedX*Row + VecIdx] wasn't working but the array width now changes to VecNormData[VecCount*Row + VecIdx] ...just kidding this actually crashes for some reason? Unaligned loads?

2023/24/6

  • Mf2 OMP parallel version, currently fastest solution, no idea why though...
  • cp1, cp2, sequential and ILP, tried ILP on the processing phase, seems to slow things down

2023/23/6

  • Median Filter sequential, had some issues getting QuickSelect working, should do a more in-depth write up
  • Turns out I could just use nth_element which is way faster

Sources

Based on Jukka Suomela's materials at https://ppc.cs.aalto.fi/ack/ Creative Commons Attribution 4.0 International (CC BY 4.0).

About

Exercises in SIMD, Multi-threading, GPU

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published