Exercises from
Currently some #1 and Top 3 solutions on the leaderboard.
- Figure out how to check the compiler type so I don't have to keep re-doing the intrinsics file switching gcc/msvc
- cp3a, down to 3.5 seconds, need to add avx512 support
AVX512 permute instructions
__m512d _mm512_permutexvar_pd (__m512i idx, __m512d a)
- Hand rolled, median of medians
- convert recursive -> iterative
- refactor quicksort into a separate partition function
2024/2/3
- okay so turns out there's a segfault probably related to WSL2 + Unified Memory issues...to resolve this I got a new laptop with good linux support lol
2023/29/07
- really bizarre issue where changing the kernel line
if(i \>= ny \|\| j \>= ny \|\| j \> i) return;
toif(i >= ny || j >= ny) return;
there's no issue?
2023/15/7
2023/08/7
- Implement float version, the fastest time! and less than 1.5; shook at 92 lines of code, 1.14, who are you rv?
2023/04/7
-
Back to avx2 but using more register re-use solidly puts me at #3 but still not below <3 seconds
-
I can maybe get down to < 3.0 with avx512 but it's a huge pain not being able to develop locally (...do I get a avx512 cpu just for this???? )
2023/03/7
- Add avx512 support, huge pain since I don't have an avx512 cpu, shaved off 10% of time. Had a problem with unitialized variable warnings which the compiler is set to treat as an error, in the cvtps_pd instructions
2023/29/6
- Turns out the test machines have avx512 support, which only helps a tiny bit (2c) due to memory constraints still
2023/27/6
-
Okay so turns out that 26/6 stuff was not helpful...I think I figure out why
-
In the mean time it seemed like just tuning the original vector output size to 4x3 or 4x4 was enough to drop to about 4.1 seconds, I also noticed that the test machines have avx512 support soooo probably should try that next
-
So this is the new kernel I tried, notably I did the commented out broadcast first, this had pretty large jumps especially in comparison to my typical way, so I increased register utilization but probably blew out the cache / pre-fetch
for (int k = kStart; k < kEnd; ++k) for(s32 i = 0; i < Rows; ++i) { // f64x4 broadcast = BroadcastF64(&Data[DimInner * (Row + i) + k]); f64x4 broadcast = BroadcastF64(&DataT[DimOuter*k + (Row + i)]); for(s32 j = 0; j < Cols; ++j) { f64x4 row = loadu ( &DataT [ (DimOuter * k) + (Col + j * VecWidth) ]); DotProds[i][j] = DotProds[i][j] + (broadcast * row); } }
2023/26/6
-
omp dynamic was helpful....
-
Major Refactor
So currently the inner most computation does something like this:
-
We want to calculate
result[x:x+3][y:y+3]
in essentially 1 pass, because we only need to load3n + 3n
vectors to compute9
outputs so a9/6n
ratio versus say1/2n
-
We do this in vectors so we load in (...just had a thought, not 100% sure if the loop was properly unrolled)
x0 = Data[x + 0][k : k+8] x1 = Data[x + 1][k : k+8] x2 = Data[x + 2][k : k+8] y0 = Data[k:k + 8][y + 0] y1 = Data[k:k + 8][y + 1] y2 = Data[k:k + 8][y + 2]
This results in this asm where we can see the 6 loads and 9 FMAs
.L42:
vmovupd ymm0, YMMWORD PTR [r9+rax*8] // load0
vmovupd ymm3, YMMWORD PTR [rsi+rax*8] // load1
vmovupd ymm2, YMMWORD PTR [rcx+rax*8] // load2
vmovupd ymm1, YMMWORD PTR [rdx+rax*8] // load3
vfmadd231pd ymm4, ymm0, ymm3
vfmadd231pd ymm13, ymm0, ymm2
vfmadd231pd ymm12, ymm0, ymm1
vmovupd ymm0, YMMWORD PTR [r8+rax*8] // load4
vfmadd231pd ymm6, ymm3, ymm0
vfmadd231pd ymm10, ymm2, ymm0
vfmadd231pd ymm9, ymm1, ymm0
vmovupd ymm0, YMMWORD PTR [rdi+rax*8] //load5
add rax, 4
vfmadd231pd ymm5, ymm3, ymm0
vfmadd231pd ymm8, ymm2, ymm0
vfmadd231pd ymm7, ymm1, ymm0
Okay so the issue now is that we're entering a memory bottleneck with larger sizes. We really need to up our computations per memory loads. It's easiest to see this by examining how matrix multiplies use element of the respective matrices.
From Eli Bendersky's Visualizing matrix multiplication as a linear combination we see that a column of the output is generated by a linear combination of the columns of the left matrix
but if we focus on a single element a
we can also think of it as
- any element of the right matrix participates in
ny
elements of the output - any element of the right matrix is multiplied by every element of a column of the left matrix
- we locate that column by the row of the element
We can also think of this in a row oriented manner
-
any element of the left matrix participates in every element of a row of the right matix
-
we locate that row based on the column of the element.
So now our new strategy is to load in a single element of the left matrix
//EXTEMELY rough sketch
f64 data[16][4] =
{
0,1,2,3,
4,5,6,7,
8,9,10,11,
12,13,14,15,
0,1,2,3,
4,5,6,7,
8,9,10,11,
12,13,14,15,
0,1,2,3,
4,5,6,7,
8,9,10,11,
12,13,14,15,
0,1,2,3,
4,5,6,7,
8,9,10,11,
12,13,14,15,
};
f64 data2[4][16] =
{
0,1,2,3,4,5,6,7, /* */ 8, 9, 10, 11, 12, 13, 14, 15,
0,1,2,3,4,5,6,7, /* */ 8, 9, 10, 11, 12, 13, 14, 15,
0,1,2,3,4,5,6,7, /* */ 8, 9, 10, 11, 12, 13, 14, 15,
0,1,2,3,4,5,6,7, /* */ 8, 9, 10, 11, 12, 13, 14, 15,
};
f64 result[16][16] = {};
void kernel(f64* NormData, s32 Row, s32 Col)
{
f64x4 DotProds[2][2] = {}
for (int k = 0; k < nx; ++k)
{
f64x4 broadcast = Broadcast(data[0][k])
f64x4 row0 = loadu(data[k][0:7])
DotProds[0][0] += broadcast + row0;
f64x4 row1 = loadu(data[k][8:16])
DotProds[0][1] += broadcast + row1;
}
}
2023/25/6
- cp3a blocking is working but still really slow, the gflops going down suggests I'll have to do some sort of blocking to get this to work
- cp2b omp parallel
- cp2c was confused why
VecNormData[PaddedX*Row + VecIdx]
wasn't working but the array width now changes toVecNormData[VecCount*Row + VecIdx]
...just kidding this actually crashes for some reason? Unaligned loads?
2023/24/6
- Mf2 OMP parallel version, currently fastest solution, no idea why though...
- cp1, cp2, sequential and ILP, tried ILP on the processing phase, seems to slow things down
2023/23/6
- Median Filter sequential, had some issues getting QuickSelect working, should do a more in-depth write up
- Turns out I could just use nth_element which is way faster
- 2D median filter w/ Linear Time Median Find using QuickSelect
-
Quicksort partition, https://en.wikipedia.org/wiki/Quicksort
-
QuickSelect, https://en.wikipedia.org/wiki/Quickselect
-
Based on Jukka Suomela's materials at https://ppc.cs.aalto.fi/ack/ Creative Commons Attribution 4.0 International (CC BY 4.0).