You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The result is like (I highlight the abnormal results):
MF8
MF4
MF2
M1
M2
M4
M8
4
38479
51332
76931
128148
230645
435399
844990
8
38521
51333
76922
128128
230579
435395
844891
16
38530
51323
76962
128129
230566
435341
845195
32
38511
51373
76932
128150
230656
435388
845083
64
38529
51322
76947
128205
230624
435417
23954097
128
38517
51338
76926
128128
230608
12351222
31148420
256
38487
51288
76945
128152
5824701
15177587
34006290
512
38526
51292
76943
2855170
7439032
16828930
35689412
1024
38511
51324
1152269
3424329
7957662
17053724
35144136
2048
38520
224200
709725
1396708
4226251
8330476
16689498
4096
38507
317053
640199
1507778
3093916
6358825
12725241
8192
38499
51349
76956
128285
1255252
2483829
4943195
16384
38525
51329
76975
128337
1255245
2484334
4975494
It's weird that we can have a better performance when stride is larger than 4096, so this issue may not be related to crossing cache-lines or pages. It may be an issue about hardware prefetcher.
So, my request is, can we add some benches of this kind of scenario?
The text was updated successfully, but these errors were encountered:
wangpc-pp
changed the title
Add benchs for strided load/store with different strides
Add benches for strided load/store with different strides
Apr 30, 2024
I'll look into it, this could be a new load/store benchmark under the instructions folder.
I tried adding the load/store instructions to the other instruction measurements, but they didn't really fit into that framework anyways.
The behavior is indeed quite weird, but how could that be a problem with the cachelines or prefetcher? Shouldn't the CPU be easily able to hold 16 cache lines, for the m1 case. I mean, it's repeatedly accessing the same few addresses.
IIRC, you could adjust the prefetch mode in the C920, so the C908 might support that as well.
The behavior is indeed quite weird, but how could that be a problem with the cachelines or prefetcher? Shouldn't the CPU be easily able to hold 16 cache lines, for the m1 case. I mean, it's repeatedly accessing the same few addresses.
Currently, this is just a guess (the L1DCache misses increase a lot) and I have sent a feedback to t-head.
Just found an issue on K230 when doing some auto-vectorization tests on https://github.com/UoB-HPC/TSVC_2.
The vectorized
s1115
is like:It seems that strided load/store with strides in [1024, 4096] have a worse performance.
A simple probe code:
The result is like (I highlight the abnormal results):
It's weird that we can have a better performance when stride is larger than 4096, so this issue may not be related to crossing cache-lines or pages. It may be an issue about hardware prefetcher.
So, my request is, can we add some benches of this kind of scenario?
The text was updated successfully, but these errors were encountered: