Skip to content

Latest commit

 

History

History
423 lines (294 loc) · 17.1 KB

benchmarks.md

File metadata and controls

423 lines (294 loc) · 17.1 KB

Benchmarks

wuffs bench -mimic summarized throughput numbers for various codecs are below. Higher is better.

"Mimic" tests check that Wuffs' output mimics (i.e. exactly matches) other libraries' output. "Mimic" benchmarks give the numbers for those other libraries, as shipped with Debian. These were measured on a Debian Testing system as of October 2019, which meant these compiler versions:

  • clang/llvm 8.0.1
  • gcc 9.2.1

and these "mimic" library versions, all written in C:

  • libgif 5.1.4
  • zlib 1.2.11

Unless otherwise stated, the numbers below were measured on an Intel x86_64 Broadwell, and were taken as of Wuffs git commit ffdce5ef "Have bench-rust-gif process animated / RGBA images".

Reproducing

The benchmark programs aim to be runnable "out of the box" without any configuration or installation. For example, to run the std/zlib benchmarks:

git clone https://github.com/google/wuffs.git
cd wuffs
gcc -O3 test/c/std/zlib.c
./a.out -bench
rm a.out

A comment near the top of that .c file says how to run the mimic benchmarks.

The output of those benchmark programs is compatible with the benchstat tool. For example, that tool can calculate confidence intervals based on multiple benchmark runs, or calculate p-values when comparing numbers before and after a code change. To install it, first install Go, then run go install golang.org/x/perf/cmd/benchstat.

wuffs bench

As mentioned above, individual benchmark programs can be run manually. However, the canonical way to run the benchmarks (across multiple compilers and multiple packages like GIF and PNG) for Wuffs' standard library is to use the wuffs command line tool, as it will also re-generate (transpile) the C code whenever you edit the std/*/*.wuffs code. Running go install -v github.com/google/wuffs/cmd/... will install the Wuffs tools. After that, you can say

wuffs bench

or

wuffs bench -mimic std/deflate

or

wuffs bench -ccompilers=gcc -reps=3 -focus=wuffs_gif_decode_20k std/gif

Clang versus GCC

On some of the benchmarks below, clang performs noticeably worse (e.g. 1.3x slower) than gcc, on the same C code. A relatively simple reproduction was filed as LLVM bug 35567.

CPU Scaling

CPU power management can inject noise in benchmark times. On a Linux system, power management can be controlled with:

# Query.
cpupower --cpu all frequency-info --policy
# Turn on.
sudo cpupower frequency-set --governor powersave
# Turn off.
sudo cpupower frequency-set --governor performance

Adler-32

The 1k, 10k, etc. numbers are approximately how many bytes are hashed.

name                                             speed     vs_mimic

wuffs_adler32_10k/clang8                         2.41GB/s  0.84x
wuffs_adler32_100k/clang8                        2.42GB/s  0.84x

wuffs_adler32_10k/gcc9                           3.24GB/s  1.13x
wuffs_adler32_100k/gcc9                          3.24GB/s  1.12x

mimic_adler32_10k                                2.87GB/s  1.00x
mimic_adler32_100k                               2.90GB/s  1.00x

CRC-32

The 1k, 10k, etc. numbers are approximately how many bytes are hashed.

name                                             speed     vs_mimic

wuffs_crc32_ieee_10k/clang8                      2.85GB/s  2.11x
wuffs_crc32_ieee_100k/clang8                     2.87GB/s  2.13x

wuffs_crc32_ieee_10k/gcc9                        3.38GB/s  2.50x
wuffs_crc32_ieee_100k/gcc9                       3.40GB/s  2.52x

mimic_crc32_ieee_10k                             1.35GB/s  1.00x
mimic_crc32_ieee_100k                            1.35GB/s  1.00x

Deflate

The 1k, 10k, etc. numbers are approximately how many bytes there in the decoded output.

The full_init vs part_init suffixes are whether WUFFS_INITIALIZE__LEAVE_INTERNAL_BUFFERS_UNINITIALIZED is unset or set.

name                                             speed     vs_mimic

wuffs_deflate_decode_1k_full_init/clang8          160MB/s  0.74x
wuffs_deflate_decode_1k_part_init/clang8          199MB/s  0.92x
wuffs_deflate_decode_10k_full_init/clang8         255MB/s  0.94x
wuffs_deflate_decode_10k_part_init/clang8         263MB/s  0.97x
wuffs_deflate_decode_100k_just_one_read/clang8    306MB/s  0.93x
wuffs_deflate_decode_100k_many_big_reads/clang8   250MB/s  0.98x

wuffs_deflate_decode_1k_full_init/gcc9            164MB/s  0.76x
wuffs_deflate_decode_1k_part_init/gcc9            207MB/s  0.95x
wuffs_deflate_decode_10k_full_init/gcc9           247MB/s  0.91x
wuffs_deflate_decode_10k_part_init/gcc9           254MB/s  0.94x
wuffs_deflate_decode_100k_just_one_read/gcc9      333MB/s  1.01x
wuffs_deflate_decode_100k_many_big_reads/gcc9     261MB/s  1.02x

mimic_deflate_decode_1k                           217MB/s  1.00x
mimic_deflate_decode_10k                          270MB/s  1.00x
mimic_deflate_decode_100k_just_one_read           329MB/s  1.00x
mimic_deflate_decode_100k_many_big_reads          256MB/s  1.00x

32-bit ARMv7 (2012 era Samsung Exynos 5 Chromebook), Debian Stretch (2017):

name                                             speed     vs_mimic

wuffs_deflate_decode_1k_full_init/clang5         30.4MB/s  0.60x
wuffs_deflate_decode_1k_part_init/clang5         37.9MB/s  0.74x
wuffs_deflate_decode_10k_full_init/clang5        72.8MB/s  0.81x
wuffs_deflate_decode_10k_part_init/clang5        76.2MB/s  0.85x
wuffs_deflate_decode_100k_just_one_read/clang5   96.5MB/s  0.82x
wuffs_deflate_decode_100k_many_big_reads/clang5  81.1MB/s  0.90x

wuffs_deflate_decode_1k_full_init/gcc6           31.6MB/s  0.62x
wuffs_deflate_decode_1k_part_init/gcc6           39.9MB/s  0.78x
wuffs_deflate_decode_10k_full_init/gcc6          69.6MB/s  0.78x
wuffs_deflate_decode_10k_part_init/gcc6          72.4MB/s  0.81x
wuffs_deflate_decode_100k_just_one_read/gcc6     87.3MB/s  0.74x
wuffs_deflate_decode_100k_many_big_reads/gcc6    73.8MB/s  0.82x

mimic_deflate_decode_1k                          51.0MB/s  1.00x
mimic_deflate_decode_10k                         89.7MB/s  1.00x
mimic_deflate_decode_100k_just_one_read           118MB/s  1.00x
mimic_deflate_decode_100k_many_big_reads         90.0MB/s  1.00x

Deflate (C, miniz)

For comparison, here are miniz 2.1.0's numbers.

name                                             speed     vs_mimic

miniz_deflate_decode_1k/clang8                    174MB/s  0.80x
miniz_deflate_decode_10k/clang8                   245MB/s  0.91x
miniz_deflate_decode_100k_just_one_read/clang8    309MB/s  0.94x

miniz_deflate_decode_1k/gcc9                      158MB/s  0.73x
miniz_deflate_decode_10k/gcc9                     221MB/s  0.82x
miniz_deflate_decode_100k_just_one_read/gcc9      250MB/s  0.76x

To reproduce these numbers, look in test/c/mimiclib/deflate-gzip-zlib.c.

Deflate (Go)

For comparison, here are Go 1.12.10's numbers, using Go's standard library's compress/flate package.

name                                             speed     vs_mimic

go_deflate_decode_1k                             45.4MB/s  0.21x
go_deflate_decode_10k                            82.5MB/s  0.31x
go_deflate_decode_100k                           94.0MB/s  0.29x

To reproduce these numbers:

git clone https://github.com/google/wuffs.git
cd wuffs/script/bench-go-deflate/
go run main.go

Deflate (Rust)

For comparison, here are Rust 1.37.0's numbers, using the alexcrichton/flate2-rs and Frommi/miniz_oxide crates, which this file suggests is the fastest pure-Rust Deflate decoder.

name                                             speed     vs_mimic

rust_deflate_decode_1k                            104MB/s  0.48x
rust_deflate_decode_10k                           202MB/s  0.75x
rust_deflate_decode_100k                          218MB/s  0.66x

To reproduce these numbers:

git clone https://github.com/google/wuffs.git
cd wuffs/script/bench-rust-deflate/
cargo run --release

GIF

The 1k, 10k, etc. numbers are approximately how many pixels there are in the decoded image. For example, the test/data/harvesters.* images are 1165 × 859, approximately 1000k pixels.

The bgra vs indexed suffixes are whether to decode to 4 bytes (BGRA or RGBA) or 1 byte (a palette index) per pixel, even if the underlying file format gives 1 byte per pixel.

The full_init vs part_init suffixes are whether WUFFS_INITIALIZE__LEAVE_INTERNAL_BUFFERS_UNINITIALIZED is unset or set.

The libgif library doesn't export any API for decode-to-BGRA or decode-to-RGBA, so there are no mimic numbers to compare to for the bgra suffix.

name                                             speed     vs_mimic

wuffs_gif_decode_1k_bw/clang8                     461MB/s  3.18x
wuffs_gif_decode_1k_color_full_init/clang8        141MB/s  1.85x
wuffs_gif_decode_1k_color_part_init/clang8        189MB/s  2.48x
wuffs_gif_decode_10k_bgra/clang8                  743MB/s  n/a
wuffs_gif_decode_10k_indexed/clang8               200MB/s  2.11x
wuffs_gif_decode_20k/clang8                       245MB/s  2.50x
wuffs_gif_decode_100k_artificial/clang8           531MB/s  3.43x
wuffs_gif_decode_100k_realistic/clang8            218MB/s  2.27x
wuffs_gif_decode_1000k_full_init/clang8           221MB/s  2.25x
wuffs_gif_decode_1000k_part_init/clang8           221MB/s  2.25x
wuffs_gif_decode_anim_screencap/clang8           1.07GB/s  6.01x

wuffs_gif_decode_1k_bw/gcc9                       478MB/s  3.30x
wuffs_gif_decode_1k_color_full_init/gcc9          148MB/s  1.94x
wuffs_gif_decode_1k_color_part_init/gcc9          194MB/s  2.54x
wuffs_gif_decode_10k_bgra/gcc9                    645MB/s  n/a
wuffs_gif_decode_10k_indexed/gcc9                 203MB/s  2.14x
wuffs_gif_decode_20k/gcc9                         244MB/s  2.49x
wuffs_gif_decode_100k_artificial/gcc9             532MB/s  3.43x
wuffs_gif_decode_100k_realistic/gcc9              214MB/s  2.23x
wuffs_gif_decode_1000k_full_init/gcc9             217MB/s  2.21x
wuffs_gif_decode_1000k_part_init/gcc9             218MB/s  2.22x
wuffs_gif_decode_anim_screencap/gcc9             1.11GB/s  6.24x

mimic_gif_decode_1k_bw                            145MB/s  1.00x
mimic_gif_decode_1k_color                        76.3MB/s  1.00x
mimic_gif_decode_10k_indexed                     94.9MB/s  1.00x
mimic_gif_decode_20k                             98.1MB/s  1.00x
mimic_gif_decode_100k_artificial                  155MB/s  1.00x
mimic_gif_decode_100k_realistic                  96.1MB/s  1.00x
mimic_gif_decode_1000k                           98.4MB/s  1.00x
mimic_gif_decode_anim_screencap                   178MB/s  1.00x

32-bit ARMv7 (2012 era Samsung Exynos 5 Chromebook), Debian Stretch (2017):

name                                             speed     vs_mimic

wuffs_gif_decode_1k_bw/clang5                    49.1MB/s  1.76x
wuffs_gif_decode_1k_color_full_init/clang5       22.3MB/s  1.35x
wuffs_gif_decode_1k_color_part_init/clang5       27.4MB/s  1.66x
wuffs_gif_decode_10k_bgra/clang5                  157MB/s  n/a
wuffs_gif_decode_10k_indexed/clang5              42.0MB/s  1.79x
wuffs_gif_decode_20k/clang5                      49.3MB/s  1.68x
wuffs_gif_decode_100k_artificial/clang5           132MB/s  2.62x
wuffs_gif_decode_100k_realistic/clang5           47.8MB/s  1.62x
wuffs_gif_decode_1000k_full_init/clang5          46.4MB/s  1.62x
wuffs_gif_decode_1000k_part_init/clang5          46.4MB/s  1.62x
wuffs_gif_decode_anim_screencap/clang5            243MB/s  4.03x

wuffs_gif_decode_1k_bw/gcc6                      46.6MB/s  1.67x
wuffs_gif_decode_1k_color_full_init/gcc6         20.1MB/s  1.22x
wuffs_gif_decode_1k_color_part_init/gcc6         24.2MB/s  1.47x
wuffs_gif_decode_10k_bgra/gcc6                    124MB/s  n/a
wuffs_gif_decode_10k_indexed/gcc6                34.8MB/s  1.49x
wuffs_gif_decode_20k/gcc6                        43.8MB/s  1.49x
wuffs_gif_decode_100k_artificial/gcc6             123MB/s  2.44x
wuffs_gif_decode_100k_realistic/gcc6             42.7MB/s  1.44x
wuffs_gif_decode_1000k_full_init/gcc6            41.6MB/s  1.45x
wuffs_gif_decode_1000k_part_init/gcc6            41.7MB/s  1.45x
wuffs_gif_decode_anim_screencap/gcc6              227MB/s  3.76x

mimic_gif_decode_1k_bw                           27.9MB/s  1.00x
mimic_gif_decode_1k_color                        16.5MB/s  1.00x
mimic_gif_decode_10k_indexed                     23.4MB/s  1.00x
mimic_gif_decode_20k                             29.4MB/s  1.00x
mimic_gif_decode_100k_artificial                 50.4MB/s  1.00x
mimic_gif_decode_100k_realistic                  29.5MB/s  1.00x
mimic_gif_decode_1000k                           28.7MB/s  1.00x
mimic_gif_decode_anim_screencap                  60.3MB/s  1.00x

GIF (Go)

For comparison, here are Go 1.12.10's numbers, using Go's standard library's image/gif package.

name                                             speed     vs_mimic

go_gif_decode_1k_bw                               107MB/s  0.74x
go_gif_decode_1k_color                           39.2MB/s  0.51x
go_gif_decode_10k_bgra                            117MB/s  n/a
go_gif_decode_10k_indexed                        57.8MB/s  0.61x
go_gif_decode_20k                                67.2MB/s  0.69x
go_gif_decode_100k_artificial                     151MB/s  0.97x
go_gif_decode_100k_realistic                     67.2MB/s  0.70x
go_gif_decode_1000k                              68.1MB/s  0.69x
go_gif_decode_anim_screencap                      206MB/s  1.16x

To reproduce these numbers:

git clone https://github.com/google/wuffs.git
cd wuffs/script/bench-go-gif/
go run main.go

GIF (Rust)

For comparison, here are Rust 1.37.0's numbers, using the image-rs/image-gif crate, easily the top crates.io result for "gif".

name                                             speed     vs_mimic

rust_gif_decode_1k_bw                            89.2MB/s  0.62x
rust_gif_decode_1k_color                         20.7MB/s  0.27x
rust_gif_decode_10k_bgra                         74.5MB/s  n/a
rust_gif_decode_10k_indexed                      20.4MB/s  0.21x
rust_gif_decode_20k                              28.9MB/s  0.29x
rust_gif_decode_100k_artificial                  79.1MB/s  0.51x
rust_gif_decode_100k_realistic                   27.9MB/s  0.29x
rust_gif_decode_1000k                            27.9MB/s  0.28x
rust_gif_decode_anim_screencap                    144MB/s  0.81x

To reproduce these numbers:

git clone https://github.com/google/wuffs.git
cd wuffs/script/bench-rust-gif/
cargo run --release

Gzip (Deflate + CRC-32)

The 1k, 10k, etc. numbers are approximately how many bytes there in the decoded output.

name                                             speed     vs_mimic

wuffs_gzip_decode_10k/clang8                      238MB/s  1.05x
wuffs_gzip_decode_100k/clang8                     273MB/s  1.03x

wuffs_gzip_decode_10k/gcc9                        239MB/s  1.06x
wuffs_gzip_decode_100k/gcc9                       297MB/s  1.12x

mimic_gzip_decode_10k                             226MB/s  1.00x
mimic_gzip_decode_100k                            265MB/s  1.00x

LZW

The 1k, 10k, etc. numbers are approximately how many bytes there in the decoded output.

The libgif library doesn't export its LZW decoder in its API, so there are no mimic numbers to compare to.

name                                             speed     vs_mimic

wuffs_lzw_decode_20k/clang8                       263MB/s  n/a
wuffs_lzw_decode_100k/clang8                      438MB/s  n/a

wuffs_lzw_decode_20k/gcc9                         266MB/s  n/a
wuffs_lzw_decode_100k/gcc9                        450MB/s  n/a

Zlib (Deflate + Adler-32)

The 1k, 10k, etc. numbers are approximately how many bytes there in the decoded output.

name                                             speed     vs_mimic

wuffs_zlib_decode_10k/clang8                      237MB/s  0.96x
wuffs_zlib_decode_100k/clang8                     272MB/s  0.92x

wuffs_zlib_decode_10k/gcc9                        242MB/s  0.98x
wuffs_zlib_decode_100k/gcc9                       294MB/s  0.99x

mimic_zlib_decode_10k                             247MB/s  1.00x
mimic_zlib_decode_100k                            296MB/s  1.00x

Updated on December 2019.