Programming Parallel Computers

Exercises from

Currently some #1 and Top 3 solutions on the leaderboard.

Todos (potentially)

Figure out how to check the compiler type so I don't have to keep re-doing the intrinsics file switching gcc/msvc

Correlated Pairs

cp3a, down to 3.5 seconds, need to add avx512 support

AVX512 permute instructions

__m512d _mm512_permutexvar_pd (__m512i idx, __m512d a)

Median Filter

Hand rolled, median of medians
convert recursive -> iterative
refactor quicksort into a separate partition function

Updates

2024/2/3

okay so turns out there's a segfault probably related to WSL2 + Unified Memory issues...to resolve this I got a new laptop with good linux support lol

2023/29/07

really bizarre issue where changing the kernel line if(i \>= ny \|\| j \>= ny \|\| j \> i) return; to if(i >= ny || j >= ny) return; there's no issue?

2023/15/7

{width="619"}

2023/08/7

Implement float version, the fastest time! and less than 1.5; shook at 92 lines of code, 1.14, who are you rv?

2023/04/7

Back to avx2 but using more register re-use solidly puts me at #3 but still not below <3 seconds
I can maybe get down to < 3.0 with avx512 but it's a huge pain not being able to develop locally (...do I get a avx512 cpu just for this???? )

2023/03/7

Add avx512 support, huge pain since I don't have an avx512 cpu, shaved off 10% of time. Had a problem with unitialized variable warnings which the compiler is set to treat as an error, in the cvtps_pd instructions

2023/29/6

Turns out the test machines have avx512 support, which only helps a tiny bit (2c) due to memory constraints still

2023/27/6

Okay so turns out that 26/6 stuff was not helpful...I think I figure out why
In the mean time it seemed like just tuning the original vector output size to 4x3 or 4x4 was enough to drop to about 4.1 seconds, I also noticed that the test machines have avx512 support soooo probably should try that next

So this is the new kernel I tried, notably I did the commented out broadcast first, this had pretty large jumps especially in comparison to my typical way, so I increased register utilization but probably blew out the cache / pre-fetch

    for (int k = kStart; k < kEnd; ++k)
        for(s32 i = 0; i < Rows; ++i)
        {
            // f64x4 broadcast = BroadcastF64(&Data[DimInner * (Row + i) + k]);
            f64x4 broadcast = BroadcastF64(&DataT[DimOuter*k + (Row + i)]);
            for(s32 j = 0; j < Cols; ++j)
            {
                f64x4 row = loadu ( &DataT [ (DimOuter * k) + (Col + j * VecWidth) ]);
                DotProds[i][j] = DotProds[i][j] + (broadcast * row);
            }
        }

2023/26/6

omp dynamic was helpful....
Major Refactor

So currently the inner most computation does something like this:

We want to calculate result[x:x+3][y:y+3] in essentially 1 pass, because we only need to load 3n + 3n vectors to compute 9 outputs so a 9/6n ratio versus say 1/2n
We do this in vectors so we load in (...just had a thought, not 100% sure if the loop was properly unrolled)
```
x0 = Data[x + 0][k : k+8]
x1 = Data[x + 1][k : k+8]
x2 = Data[x + 2][k : k+8]

y0 = Data[k:k + 8][y + 0]
y1 = Data[k:k + 8][y + 1]
y2 = Data[k:k + 8][y + 2]
```
This results in this asm where we can see the 6 loads and 9 FMAs

.L42:
        vmovupd ymm0, YMMWORD PTR [r9+rax*8]  // load0
        vmovupd ymm3, YMMWORD PTR [rsi+rax*8] // load1
        vmovupd ymm2, YMMWORD PTR [rcx+rax*8] // load2
        vmovupd ymm1, YMMWORD PTR [rdx+rax*8] // load3
        vfmadd231pd     ymm4, ymm0, ymm3
        vfmadd231pd     ymm13, ymm0, ymm2
        vfmadd231pd     ymm12, ymm0, ymm1
        vmovupd ymm0, YMMWORD PTR [r8+rax*8] // load4
        vfmadd231pd     ymm6, ymm3, ymm0
        vfmadd231pd     ymm10, ymm2, ymm0
        vfmadd231pd     ymm9, ymm1, ymm0
        vmovupd ymm0, YMMWORD PTR [rdi+rax*8] //load5
        add     rax, 4
        vfmadd231pd     ymm5, ymm3, ymm0
        vfmadd231pd     ymm8, ymm2, ymm0
        vfmadd231pd     ymm7, ymm1, ymm0

Okay so the issue now is that we're entering a memory bottleneck with larger sizes. We really need to up our computations per memory loads. It's easiest to see this by examining how matrix multiplies use element of the respective matrices.

From Eli Bendersky's Visualizing matrix multiplication as a linear combination we see that a column of the output is generated by a linear combination of the columns of the left matrix

but if we focus on a single element a we can also think of it as

any element of the right matrix participates in ny elements of the output
any element of the right matrix is multiplied by every element of a column of the left matrix
we locate that column by the row of the element

We can also think of this in a row oriented manner

any element of the left matrix participates in every element of a row of the right matix
we locate that row based on the column of the element.

So now our new strategy is to load in a single element of the left matrix

//EXTEMELY rough sketch
f64 data[16][4] = 
{
    0,1,2,3,
    4,5,6,7,
    8,9,10,11,
    12,13,14,15,
    0,1,2,3,
    4,5,6,7,
    8,9,10,11,
    12,13,14,15,
    0,1,2,3,
    4,5,6,7,
    8,9,10,11,
    12,13,14,15,
    0,1,2,3,
    4,5,6,7,
    8,9,10,11,
    12,13,14,15,
};

f64 data2[4][16] = 
{
    0,1,2,3,4,5,6,7, /*  */   8, 9, 10, 11, 12, 13, 14, 15,
    0,1,2,3,4,5,6,7, /*  */   8, 9, 10, 11, 12, 13, 14, 15,
    0,1,2,3,4,5,6,7, /*  */   8, 9, 10, 11, 12, 13, 14, 15,
    0,1,2,3,4,5,6,7, /*  */   8, 9, 10, 11, 12, 13, 14, 15,
};

f64 result[16][16] = {};

void kernel(f64* NormData, s32 Row, s32 Col)
{
    f64x4 DotProds[2][2] = {}

    for (int k = 0; k < nx; ++k)
    {
        f64x4 broadcast = Broadcast(data[0][k])

        f64x4 row0 = loadu(data[k][0:7])

        DotProds[0][0] += broadcast + row0;

        f64x4 row1 = loadu(data[k][8:16])   

        DotProds[0][1] += broadcast + row1;
    }
}

2023/25/6

cp3a blocking is working but still really slow, the gflops going down suggests I'll have to do some sort of blocking to get this to work
cp2b omp parallel
cp2c was confused why VecNormData[PaddedX*Row + VecIdx] wasn't working but the array width now changes to VecNormData[VecCount*Row + VecIdx] ...just kidding this actually crashes for some reason? Unaligned loads?

2023/24/6

Mf2 OMP parallel version, currently fastest solution, no idea why though...
cp1, cp2, sequential and ILP, tried ILP on the processing phase, seems to slow things down

2023/23/6

Median Filter sequential, had some issues getting QuickSelect working, should do a more in-depth write up
Turns out I could just use nth_element which is way faster

Sources

2D median filter w/ Linear Time Median Find using QuickSelect
- Quicksort partition, https://en.wikipedia.org/wiki/Quicksort
- QuickSelect, https://en.wikipedia.org/wiki/Quickselect

Based on Jukka Suomela's materials at https://ppc.cs.aalto.fi/ack/ Creative Commons Attribution 4.0 International (CC BY 4.0).

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
correlated_pairs		correlated_pairs
images		images
median_filter		median_filter
minimum_distance		minimum_distance
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Programming Parallel Computers

Todos (potentially)

Correlated Pairs

Median Filter

Updates

Sources

About

Releases

Packages

Languages

stone-d-chen/prog-para-comp

Folders and files

Latest commit

History

Repository files navigation

Programming Parallel Computers

Todos (potentially)

Correlated Pairs

Median Filter

Updates

Sources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages