- Paolo Savini (Embecosm)
- Helene Chelin (Embecosm)
- Jeremy Bennett (Embecosm)
- Hugh O'Keeffe (Ashling)
- Nadim Shehayed (Ashling)
- Daniel Barboza (Ventana)
- WP1:
- Identify the most promising load/store vector instruction to optimize.
- Collected the counts of vector load/store in SPEC CPU (see below).
- Prepare routine/nightly runs of benchmarks.
- IN PROGRESS.
- Collected statistics on memcpy performance with different VLEN (see below).
- Identify the most promising load/store vector instruction to optimize.
-
WP1:
- Complete setting up routine/nightly runs of benchmarks.
-
WP2:
- Start optimization of
vle8.v
andvse8.v
.
- Start optimization of
Our current set of agreed priorities are taken from the Statement of Work. This has the following priorities, which trade off functionality targeted versus architectures supported.
- vector load/store ops for x86_64 AVX
- vector load/store ops for AArch64/Neon
- vector integer ALU ops for x86_64 AVX
- vector load/store ops for Intel AVX10
For each of these there will be an analysis phase and an optimization phase, leading to the following set of work packages.
- WP0: Infrastructure
- WP1: Analysis of vector load/store ops on x86_64 AVX
- WP2: Optimization of vector load/store ops on x86_64 AVX
- WP3: Analysis of vector load/store ops on AArch64/Neon
- WP4: Optimization of vector load/store ops on AArch64/Neon
- WP5: Analysis of integer ALU ops on x86_64 AVX
- WP6: Optimization of integer ALU ops on x86_64 AVX
- WP7: Analysis of vector load/store ops on Intel AVX10
- WP8: Optimization of vector load/store ops on Intel AVX10
These priorities can be revised by agreement with RISE during the project.
We have done a static search of the SPEC CPU 2017 benchmarks for the vector load/store instructions and for the results these instructions appear to be the most common:
Instruction | Count |
---|---|
vsetvli | 14528 |
vse64.v | 12018 |
vle64.v | 8711 |
vse8.v | 7699 |
vle8.v | 6806 |
vlm.v | 5632 |
vle32.v | 5091 |
vse32.v | 4182 |
vluxei64.v | 1388 |
vle16ff.v | 1077 |
vse16.v | 877 |
vle16.v | 431 |
vsuxei64.v | 272 |
vlse64.v | 126 |
vsuxei32.v | 123 |
vlse8.v | 121 |
vluxei8.v | 98 |
vlse32.v | 55 |
vle8ff.v | 22 |
vl1r.v | 17 |
Apart from vsetvli that expectedely is used very often, the vle*/vse* instructions dominate the scene. These instructions have been counted statically although they provide useful statistics that suggest where to look for an early impact in the optimization work.
We have an update to the performance of memcpy
, extending the table from last week to blocks of up to 65,536 bytes. We show the statistics for scalar memcpy
(s), vector with LMUL=1
(v1) and vector with LMUL=8
(v8)
length | s time | v1 time | v8 time | s Micount | v1 Micount | v8 Micount | s ns/inst | v1 ns/inst | v8 ns/inst |
---|---|---|---|---|---|---|---|---|---|
1 | 0.16 | 0.14 | 0.12 | 73 | 19 | 19 | 2.19 | 7.37 | 6.32 |
2 | 0.22 | 0.12 | 0.14 | 89 | 19 | 19 | 2.47 | 6.32 | 7.37 |
4 | 0.24 | 0.20 | 0.17 | 121 | 19 | 19 | 1.98 | 10.53 | 8.95 |
8 | 0.23 | 0.29 | 0.27 | 95 | 19 | 19 | 2.42 | 15.26 | 14.21 |
16 | 0.23 | 0.44 | 0.44 | 111 | 19 | 19 | 2.07 | 23.16 | 23.16 |
32 | 0.28 | 0.86 | 0.81 | 143 | 26 | 19 | 1.96 | 33.08 | 42.63 |
64 | 0.43 | 1.74 | 1.45 | 207 | 40 | 19 | 2.08 | 43.50 | 76.32 |
128 | 0.57 | 3.34 | 2.85 | 293 | 68 | 19 | 1.95 | 49.12 | 150.00 |
256 | 0.89 | 6.62 | 5.67 | 451 | 124 | 26 | 1.97 | 53.39 | 218.08 |
512 | 1.52 | 13.28 | 11.27 | 767 | 236 | 40 | 1.98 | 56.27 | 281.75 |
1,024 | 2.83 | 26.28 | 22.45 | 1,448 | 460 | 68 | 1.95 | 57.13 | 330.15 |
2,048 | 5.49 | 52.41 | 44.85 | 2,810 | 908 | 124 | 1.95 | 57.72 | 361.69 |
4,096 | 10.72 | 105.43 | 89.67 | 5,534 | 1,804 | 236 | 1.94 | 58.44 | 379.96 |
8,192 | 21.21 | 210.61 | 179.43 | 10,933 | 3,596 | 460 | 1.94 | 58.57 | 390.07 |
16,384 | 42.54 | 421.70 | 359.58 | 21,731 | 7,180 | 908 | 1.96 | 58.73 | 396.01 |
These can be seen graphically as follows:
We can see that for larger block sizes performance per instruction is trending to an asymptoptic limit, and this happens sooner for LMUL=1
compared to LMUL=8
. We observe that the asymptoptic ratio between LMUL=8
and LMUL=1
is around 7, so under QEMU it is still more efficient to use the larger LMUL
value, even with the current implementation.
We had an action to look at the impact of VLEN
on performance. We used only data for LMUL=8
, and obtained the following results.
Length | time128 | time256 | time512 | time1024 | MInstr128 | MInstr256 | MInstr512 | MInstr1024 | ns/instr128 | ns/instr256 | ns/instr512 | ns/instr1024 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.12 | 0.10 | 0.12 | 0.10 | 19 | 19 | 19 | 19 | 6.32 | 5.26 | 6.32 | 5.26 |
2 | 0.14 | 0.12 | 0.11 | 0.13 | 19 | 19 | 19 | 19 | 7.37 | 6.32 | 5.79 | 6.84 |
4 | 0.17 | 0.15 | 0.20 | 0.21 | 19 | 19 | 19 | 19 | 8.95 | 7.89 | 10.53 | 11.05 |
8 | 0.27 | 0.28 | 0.25 | 0.25 | 19 | 19 | 19 | 19 | 14.21 | 14.74 | 13.16 | 13.16 |
16 | 0.44 | 0.44 | 0.43 | 0.44 | 19 | 19 | 19 | 19 | 23.16 | 23.16 | 22.63 | 23.16 |
32 | 0.81 | 0.79 | 0.81 | 0.81 | 19 | 19 | 19 | 19 | 42.63 | 41.58 | 42.63 | 42.63 |
64 | 1.45 | 1.46 | 1.49 | 1.50 | 19 | 19 | 19 | 19 | 76.32 | 76.84 | 78.42 | 78.95 |
128 | 2.85 | 2.83 | 2.84 | 2.85 | 19 | 19 | 19 | 19 | 150.00 | 148.95 | 149.47 | 150.00 |
256 | 5.67 | 5.59 | 5.57 | 5.59 | 26 | 19 | 19 | 19 | 218.08 | 294.21 | 293.16 | 294.21 |
512 | 11.27 | 11.12 | 11.11 | 11.09 | 40 | 26 | 19 | 19 | 281.75 | 427.69 | 584.74 | 583.68 |
1024 | 22.45 | 22.22 | 22.11 | 22.05 | 68 | 40 | 26 | 19 | 330.15 | 555.50 | 850.38 | 1,160.53 |
2048 | 44.85 | 44.25 | 44.24 | 44.04 | 124 | 68 | 40 | 26 | 361.69 | 650.74 | 1,106.00 | 1,693.84 |
4096 | 89.67 | 88.59 | 88.26 | 88.15 | 236 | 124 | 68 | 40 | 379.96 | 714.44 | 1,297.94 | 2,203.75 |
8192 | 179.43 | 177.06 | 176.45 | 176.12 | 460 | 236 | 124 | 68 | 390.07 | 750.25 | 1,422.98 | 2,590.00 |
Again this is best visualized as a graph :{width=160mm}
We see the same asymptoptic trends, although for the larger values of VLEN, we have not yet reached the asymptote. The QEMU time per instructions is more of less proportional to VLEN, with only marginal benefit to using VLEN=1024 over VLEN=128.
We have done a detailed performance analysis of the two target instructions, which can be found in this Google Sheet.
These provide us the basis for benchmarking improvements to these two instructions. We present full data here, but for routine benchmarking, we will select a suitable subset of the data points
Our test framework creates two assembly language programs, one with a loop containing single instance of the instruction being measured and one with a loop with 11 instances of the instruction being measured. We time these two programs using a suitably large loop, and the difference in times between the two is the time due to the 10 extra instances of the instruction.
The following two tables and graphs show the performance of vle8.v
and vse8.v
respectively for different sizes of data and different values for VLEN
. In this case we se LMUL=8
, but the spreadsheet also shows results for other values of LMUL. The results are all times per instruction in nanoseconds.
vle8.v
length | VLEN=128 | VLEN=256 | VLEN=512 | VLEN=1024 |
---|---|---|---|---|
1 | 21.14 | 21.14 | 21.13 | 21.26 |
2 | 33.45 | 33.61 | 33.48 | 33.47 |
4 | 62.19 | 62.16 | 62.12 | 62.19 |
8 | 108.94 | 109.06 | 109.75 | 110.88 |
16 | 209.50 | 210.00 | 210.00 | 209.88 |
32 | 397.00 | 396.75 | 396.25 | 397.00 |
64 | 771.50 | 772.00 | 772.00 | 772.50 |
128 | 1524.00 | 1521.00 | 1520.00 | 1523.00 |
256 | 1519.00 | 3008.00 | 3008.00 | 3008.00 |
512 | 1520.00 | 3014.00 | 5990.00 | 5995.00 |
1,024 | 1523.00 | 3010.00 | 5989.00 | 11963.00 |
2,048 | 1519.00 | 3012.00 | 5998.00 | 11963.00 |
4,096 | 1522.00 | 3009.00 | 5995.00 | 11962.00 |
8,192 | 1522.00 | 3012.00 | 5992.00 | 11959.00 |
vse8.v
length | VLEN=128 | VLEN=256 | VLEN=512 | VLEN=1024 |
---|---|---|---|---|
1 | 19.24 | 19.45 | 19.45 | 19.45 |
2 | 30.94 | 28.92 | 28.89 | 28.95 |
4 | 53.22 | 54.47 | 53.31 | 53.38 |
8 | 92.00 | 92.06 | 93.44 | 92.06 |
16 | 177.62 | 177.62 | 177.75 | 177.38 |
32 | 333.00 | 332.75 | 332.50 | 333.75 |
64 | 644.50 | 644.00 | 643.00 | 646.00 |
128 | 1270.00 | 1271.00 | 1270.00 | 1270.00 |
256 | 1267.00 | 2506.00 | 2509.00 | 2512.00 |
512 | 1270.00 | 2511.00 | 4988.00 | 4991.00 |
1,024 | 1270.00 | 2508.00 | 4991.00 | 9954.00 |
2,048 | 1270.00 | 2511.00 | 4991.00 | 9953.00 |
4,096 | 1266.00 | 2509.00 | 4988.00 | 9952.00 |
8,192 | 1267.00 | 2511.00 | 4987.00 | 9954.00 |
Each set of tests takes around 90 minutes to complete on one of our large servers.
We can see that performance is dependent on data size, up to the point where the vector is full, when as expected it plateaus. The impact of VLEN
is only to set the point of that plateau.
We do not show it here, but the impact of LMUL
is similar. The performance depends on data size up to the point where the vector is full. The impact of LMUL
is again only to set the point of that plateau.
You can find the baseline execution time and instruction count of the SPEC CPU 2017 benchmarks here
New tests will appear here as soon as the nightly tests are up and running.
2024-05-22
- Jeremy to run baseline results for the other flavours of the vle*.v/vse*.v instructions
- Paolo to check the ARM SVE example mentioned int the gitlab issue (see 2024-05-01).
2024-05-15
- Jeremy to look at impact of masked v unmasked and strided v unstrided on vector operations.
- lower proirity.
- Jeremy to look at impact of VLEN > 128:
- QEMU currently supports up to 1,024, RVV standard permits up to 65,536.
- COMPLETE. See above.
2024-05-08
- Jeremy to characterise QEMU floating point performance and file it as a performance regression issue in QEMU GitLab.
- low priority, deferred to prioritize the smoke tests work.
2024-05-01
- Paolo to review the generic issue from Palmer Dabbelt to identify ideas for optimization and benchmarks to reuse.
- IN PROGRESS: we are working on running the reproducers to see the TCG ops generated by QEMU.
- We might need a different reproducer.
- So far we didn't see the execution time difference reported in the issue. Need to check the context.
- The bionic benchmarks may be a useful source of small benchmarks.
- IN PROGRESS: we are working on running the reproducers to see the TCG ops generated by QEMU.
- Daniel to advise Paolo on best practice for preparing QEMU upstream submissions.
The risk register is held in a shared spreadsheet
We will keep it updated continuously and report any changes each week.
There are no changes to the risk register this week.
No planned vacations for the rest of the month.