[L0 v2] [TESTING] Disable copy offload #2329

igchor · 2024-11-14T19:26:13Z

No description provided.

github-actions · 2024-11-14T19:28:04Z

Compute Benchmarks level_zero_v2 run (with params: --compare baseline-v2):
https://github.com/oneapi-src/unified-runtime/actions/runs/11844154604

github-actions · 2024-11-14T19:59:01Z

Compute Benchmarks level_zero_v2 run (--compare baseline-v2):
https://github.com/oneapi-src/unified-runtime/actions/runs/11844154604
Job status: success. Test status: success.

Summary

No diffs to calculate performance change

(result is better)

Performance change in benchmark groups

Relative perf in group api (6): cannot calculate

Benchmark	This PR	baseline	baseline-v2
api_overhead_benchmark_sycl SubmitKernel out of order	21.251000 μs	25.558 μs	21.564 μs
api_overhead_benchmark_sycl SubmitKernel in order	21.968000 μs	25.661 μs	22.359 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	1.831000 μs	2.356 μs	1.915 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	1.897 μs	1.602000 μs	2.077 μs
api_overhead_benchmark_ur SubmitKernel out of order	12.294000 μs	14.215 μs	14.946 μs
api_overhead_benchmark_ur SubmitKernel in order	12.434000 μs	14.117 μs	14.602 μs

Relative perf in group memory (3): cannot calculate

Benchmark	This PR	baseline	baseline-v2
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	200.939 μs	225.524 μs	194.596000 μs
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	86.788 μs	113.146 μs	83.673000 μs
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	5.712000 μs	5.805 μs	5.830 μs

Relative perf in group miscellaneous (1): cannot calculate

Benchmark	This PR	baseline	baseline-v2	Relative perf	Change	-
miscellaneous_benchmark_sycl VectorSum	803.640000 μs	804.427 μs	804.343 μs

Relative perf in group multithread (8): cannot calculate

Benchmark	This PR	baseline	baseline-v2
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	3559.312 μs	6755.369 μs	3556.123000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	8198.343000 μs	17459.852 μs	8254.288 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	25924.668 μs	24790.821000 μs	25430.474 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	1131.232 μs	1050.107000 μs	1085.361 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	4573.173 μs	7587.697 μs	4489.427000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	6557.586 μs	8364.963 μs	6357.451000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	25736.169 μs	25122.508000 μs	25429.835 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	1106.205 μs	1035.863000 μs	1100.991 μs

Relative perf in group Velocity-Bench (6): cannot calculate

Benchmark	This PR	baseline	baseline-v2
Velocity-Bench Hashtable	382.023 M keys/sec	378.825 M keys/sec	383.674692 M keys/sec
Velocity-Bench Bitcracker	35.279 s	35.240 s	35.209900 s
Velocity-Bench CudaSift	201.588000 ms	205.604 ms	202.273 ms
Velocity-Bench Easywave	232.000000 ms	241.000 ms	232.000 ms
Velocity-Bench QuickSilver	121.350 MMS/CTT	117.670 MMS/CTT	121.530000 MMS/CTT
Velocity-Bench Sobel Filter	517.442 ms	533.423 ms	510.451000 ms

Relative perf in group Runtime (8): cannot calculate

Benchmark	This PR	baseline	baseline-v2
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	178.736 ms	274.510 ms	177.957000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	181.917000 ms	277.242 ms	182.328 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	179.273000 ms	272.438 ms	179.384 ms
Runtime_IndependentDAGTaskThroughput_SingleTask	172.578 ms	264.149 ms	172.442000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor	1200.683000 ms	1671.528 ms	1301.770 ms
Runtime_DAGTaskThroughput_BasicParallelFor	1238.459000 ms	1720.320 ms	1334.080 ms
Runtime_DAGTaskThroughput_SingleTask	1168.129000 ms	1673.057 ms	1260.803 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor	1228.853000 ms	1693.956 ms	1327.302 ms

Relative perf in group MicroBench (14): cannot calculate

Benchmark	This PR	baseline	baseline-v2
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	4.555 ms	4.540 ms	4.499000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	4.397 ms	4.388000 ms	4.401 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	4.488 ms	4.455 ms	4.452000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	3.689000 ms	4.617 ms	3.706 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	4.480 ms	4.519 ms	4.475000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	618.130 ms	618.055000 ms	618.078 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	3.730000 ms	4.530 ms	3.733 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	617.349 ms	617.315000 ms	617.385 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	617.364 ms	617.303000 ms	617.386 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	618.137 ms	618.061000 ms	618.067 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	4.362 ms	4.414 ms	4.291000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	4.557 ms	4.471 ms	4.458000 ms
MicroBench_LocalMem_int32_4096	29.840000 ms	29.949 ms	29.925 ms
MicroBench_LocalMem_fp32_4096	29.872000 ms	29.901 ms	29.903 ms

Relative perf in group Pattern (10): cannot calculate

Benchmark	This PR	baseline	baseline-v2
Pattern_Reduction_NDRange_int32	16.736000 ms	17.098 ms	16.783 ms
Pattern_Reduction_Hierarchical_int32	17.051 ms	17.113 ms	16.949000 ms
Pattern_SegmentedReduction_Hierarchical_int16	11.800 ms	11.807 ms	11.793000 ms
Pattern_SegmentedReduction_NDRange_int16	2.256 ms	2.270 ms	2.252000 ms
Pattern_SegmentedReduction_Hierarchical_int32	11.598 ms	11.595 ms	11.591000 ms
Pattern_SegmentedReduction_NDRange_int64	2.345000 ms	2.348 ms	2.347 ms
Pattern_SegmentedReduction_Hierarchical_int64	11.781 ms	11.783 ms	11.778000 ms
Pattern_SegmentedReduction_Hierarchical_fp32	11.593 ms	11.589000 ms	11.591 ms
Pattern_SegmentedReduction_NDRange_fp32	2.161 ms	2.166 ms	2.160000 ms
Pattern_SegmentedReduction_NDRange_int32	2.165000 ms	2.167 ms	2.166 ms

Relative perf in group ScalarProduct (6): cannot calculate

Benchmark	This PR	baseline	baseline-v2
ScalarProduct_Hierarchical_int32	10.329 ms	10.318 ms	10.315000 ms
ScalarProduct_Hierarchical_int64	11.362 ms	11.309000 ms	11.359 ms
ScalarProduct_Hierarchical_fp32	9.960 ms	9.955000 ms	9.955 ms
ScalarProduct_NDRange_int64	5.503 ms	5.433000 ms	5.492 ms
ScalarProduct_NDRange_int32	3.810 ms	3.752000 ms	3.811 ms
ScalarProduct_NDRange_fp32	3.809 ms	3.747000 ms	3.804 ms

Relative perf in group USM (7): cannot calculate

Benchmark	This PR	baseline	baseline-v2
USM_Allocation_latency_fp32_shared	0.064 ms	0.052000 ms	0.065 ms
USM_Allocation_latency_fp32_host	37.599 ms	37.579 ms	37.451000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	1.180000 ms	1.203 ms	1.221 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	1.021000 ms	1.046 ms	1.037 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	1.558000 ms	1.823 ms	1.575 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	1.324 ms	1.676 ms	1.315000 ms
USM_Allocation_latency_fp32_device	-	0.065000 ms	-

Relative perf in group VectorAddition (3): cannot calculate

Benchmark	This PR	baseline	baseline-v2
VectorAddition_int32	1.497 ms	1.449000 ms	1.491 ms
VectorAddition_fp32	1.497 ms	1.464000 ms	1.492 ms
VectorAddition_int64	3.108 ms	3.061000 ms	3.101 ms

Relative perf in group Polybench (3): cannot calculate

Benchmark	This PR	baseline	baseline-v2
Polybench_2mm	1.218 ms	1.212000 ms	1.223 ms
Polybench_3mm	1.820 ms	1.732000 ms	1.818 ms
Polybench_Atax	6.712000 ms	6.713 ms	6.880 ms

Relative perf in group Kmeans (1): cannot calculate

Benchmark	This PR	baseline	baseline-v2	Relative perf	Change	-
Kmeans_fp32	16.055 ms	16.055 ms	16.052000 ms

Relative perf in group MolecularDynamics (1): cannot calculate

Benchmark	This PR	baseline	baseline-v2	Relative perf	Change	-
MolecularDynamics	0.027 ms	0.025000 ms	0.025 ms

Relative perf in group llama.cpp (6): cannot calculate

Benchmark	This PR	baseline	baseline-v2
llama.cpp Prompt Processing Batched 256	948.390156 token/s	902.762 token/s	946.968 token/s
llama.cpp Text Generation Batched 256	65.247424 token/s	62.682 token/s	64.879 token/s
llama.cpp Text Generation Batched 512	65.370269 token/s	62.638 token/s	64.872 token/s
llama.cpp Prompt Processing Batched 512	472.493 token/s	449.025 token/s	486.989697 token/s
llama.cpp Text Generation Batched 128	65.237439 token/s	62.696 token/s	65.115 token/s
llama.cpp Prompt Processing Batched 128	825.634 token/s	843.987 token/s	894.004204 token/s

Relative perf in group LinearRegressionCoeff (1): cannot calculate

Benchmark	This PR	baseline	baseline-v2	Relative perf	Change	-
LinearRegressionCoeff_fp32	-	858.973000 ms	-

Output:

---------> BitCracker: BitLocker password cracking tool <---------

==================================
Retrieving Info

Reading hash file "/home/pmdk/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/img_win8_user_hash.txt"

              Attack

================================================
Type of attack: User Password
Psw per thread: 1
max_num_pswd_per_read: 60000
Dictionary: /home/pmdk/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/user_passwords_60000.txt
MAC Comparison (-m): Yes

Iter: 1, num passwords read: 60000
Kernel execution:
Effective passwords: 60000
Passwords Range:
npknpByH7N2m3OnLNH1X9DJxLrzIFWk
.....
dL_7uuf3QCz-c6K3xDu0

================================================
Bitcracker attack completed
Total passwords evaluated: 60000
Password not found!

time to subtract from total: 0.00408227 s
bitcracker - total time for whole calculation: 35.2792 s

Velocity-Bench CudaSift

Environment Variables:

Command:

/home/pmdk/bench_workdir/cudaSift/cudaSift

Output:

UNKN:

UNKN: ==================================================
UNKN: User input parameters:
UNKN: Trace: ../../inputData
UNKN: ==================================================
UNKN:

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1238 1271 33.6139% 1 2