Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add benches for strided load/store with different strides #12

Open
wangpc-pp opened this issue Apr 30, 2024 · 2 comments
Open

Add benches for strided load/store with different strides #12

wangpc-pp opened this issue Apr 30, 2024 · 2 comments

Comments

@wangpc-pp
Copy link

wangpc-pp commented Apr 30, 2024

Just found an issue on K230 when doing some auto-vectorization tests on https://github.com/UoB-HPC/TSVC_2.
The vectorized s1115 is like:

.LBB9_7:                                # %vector.ph
    andi    a6, s6, 256
    vsetvli a2, zero, e32, m2, ta, ma
.LBB9_8:                                # %vector.body
    vl2re32.v   v8, (a4)
    vlse32.v    v10, (a5), s11          # s11 = 1024
    vl2re32.v   v12, (a2)
    vfmacc.vv   v12, v8, v10
    vs2r.v  v12, (a4)
    add a4, a4, s0
    add a2, a2, s0
    sub a3, a3, s9
    add a5, a5, s2
    bnez    a3, .LBB9_8

It seems that strided load/store with strides in [1024, 4096] have a worse performance.
A simple probe code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define DEFINE_VLSE(LMUL)                                                      \
  __attribute__((always_inline)) void vlse_##LMUL(int *base, int stride) {     \
    __asm__("vsetvli	t0, zero, e8, " #LMUL ", ta, ma\n"                     \
            "vlse8.v	v0, (%0), %1" ::"r"(base),                             \
            "r"(stride));                                                      \
  }

DEFINE_VLSE(m1)
DEFINE_VLSE(m2)
DEFINE_VLSE(m4)
DEFINE_VLSE(m8)
DEFINE_VLSE(mf2)
DEFINE_VLSE(mf4)
DEFINE_VLSE(mf8)

int main(int argc, char **argv) {
  int stride = atoi(argv[1]);
  int times = atoi(argv[2]);

  // __attribute__((aligned(64)))
  int data[64 * stride];

#define BENCH_VLSE(LMUL)                                                       \
  {                                                                            \
    clock_t start = clock();                                                   \
    for (int i = 0; i < times; i++)                                            \
      vlse_##LMUL(data, stride);                                               \
    clock_t end = clock();                                                     \
    printf("LMUL: " #LMUL "\tstride: %d\t time: %ld\n", stride, end - start);  \
  }

  BENCH_VLSE(mf8)
  BENCH_VLSE(mf4)
  BENCH_VLSE(mf2)
  BENCH_VLSE(m1)
  BENCH_VLSE(m2)
  BENCH_VLSE(m4)
  BENCH_VLSE(m8)
}

The result is like (I highlight the abnormal results):

MF8 MF4 MF2 M1 M2 M4 M8
4 38479 51332 76931 128148 230645 435399 844990
8 38521 51333 76922 128128 230579 435395 844891
16 38530 51323 76962 128129 230566 435341 845195
32 38511 51373 76932 128150 230656 435388 845083
64 38529 51322 76947 128205 230624 435417 23954097
128 38517 51338 76926 128128 230608 12351222 31148420
256 38487 51288 76945 128152 5824701 15177587 34006290
512 38526 51292 76943 2855170 7439032 16828930 35689412
1024 38511 51324 1152269 3424329 7957662 17053724 35144136
2048 38520 224200 709725 1396708 4226251 8330476 16689498
4096 38507 317053 640199 1507778 3093916 6358825 12725241
8192 38499 51349 76956 128285 1255252 2483829 4943195
16384 38525 51329 76975 128337 1255245 2484334 4975494

It's weird that we can have a better performance when stride is larger than 4096, so this issue may not be related to crossing cache-lines or pages. It may be an issue about hardware prefetcher.

So, my request is, can we add some benches of this kind of scenario?

@wangpc-pp wangpc-pp changed the title Add benchs for strided load/store with different strides Add benches for strided load/store with different strides Apr 30, 2024
@camel-cdr
Copy link
Owner

I'll look into it, this could be a new load/store benchmark under the instructions folder.
I tried adding the load/store instructions to the other instruction measurements, but they didn't really fit into that framework anyways.

The behavior is indeed quite weird, but how could that be a problem with the cachelines or prefetcher? Shouldn't the CPU be easily able to hold 16 cache lines, for the m1 case. I mean, it's repeatedly accessing the same few addresses.

IIRC, you could adjust the prefetch mode in the C920, so the C908 might support that as well.

@wangpc-pp
Copy link
Author

The behavior is indeed quite weird, but how could that be a problem with the cachelines or prefetcher? Shouldn't the CPU be easily able to hold 16 cache lines, for the m1 case. I mean, it's repeatedly accessing the same few addresses.

Currently, this is just a guess (the L1DCache misses increase a lot) and I have sent a feedback to t-head.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants