-
Notifications
You must be signed in to change notification settings - Fork 44
/
README
128 lines (104 loc) · 4.88 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
This is a public-domain implementation of several rANS variants. rANS is an
entropy coder from the ANS family, as described in Jarek Duda's paper
"Asymmetric numeral systems" (http://arxiv.org/abs/1311.2540).
- "rans_byte.h" has a byte-aligned rANS encoder/decoder and some comments on
how to use it. This implementation should work on all 32-bit architectures.
"main.cpp" is an example program that shows how to use it.
- "rans64.h" is a 64-bit version that emits entire 32-bit words at a time. It
is (usually) a good deal faster than rans_byte on 64-bit architectures, and
also makes for a very precise arithmetic coder (i.e. it gets quite close
to entropy). The trade-off is that this version will be slower on 32-bit
machines, and the output bitstream is not endian-neutral. "main64.cpp" is
the corresponding example.
- "rans_word_sse41.h" has a SIMD decoder (SSE 4.1 to be precise) that does IO
in units of 16-bit words. It has less precision than either rans_byte or
rans64 (meaning that it doesn't get as close to entropy) and requires
at least 4 independent streams of data to be useful; however, it is also a
good deal faster. "main_simd.cpp" shows how to use it.
See my blog http://fgiesen.wordpress.com/ for some notes on the design.
I've also written a paper on interleaving output streams from multiple entropy
coders:
http://arxiv.org/abs/1402.3392
this documents the underlying design for "rans_word_sse41", and also shows how
the same approach generalizes to e.g. GPU implementations, provided there are
enough independent contexts coded at the same time to fill up a warp/wavefront
or whatever your favorite GPU's terminology for its native SIMD width is.
Finally, there's also "main_alias.cpp", which shows how to combine rANS with
the alias method to get O(1) symbol lookup with table size proportional to the
number of symbols. I presented an overview of the underlying idea here:
http://fgiesen.wordpress.com/2014/02/18/rans-with-static-probability-distributions/
Results on my machine (Sandy Bridge i7-2600K) with rans_byte in 64-bit mode:
----
rANS encode:
12896496 clocks, 16.8 clocks/symbol (192.8MiB/s)
12486912 clocks, 16.2 clocks/symbol (199.2MiB/s)
12511975 clocks, 16.3 clocks/symbol (198.8MiB/s)
12660765 clocks, 16.5 clocks/symbol (196.4MiB/s)
12550285 clocks, 16.3 clocks/symbol (198.2MiB/s)
rANS: 435113 bytes
17023550 clocks, 22.1 clocks/symbol (146.1MiB/s)
18081509 clocks, 23.5 clocks/symbol (137.5MiB/s)
16901632 clocks, 22.0 clocks/symbol (147.1MiB/s)
17166188 clocks, 22.3 clocks/symbol (144.9MiB/s)
17235859 clocks, 22.4 clocks/symbol (144.3MiB/s)
decode ok!
interleaved rANS encode:
9618004 clocks, 12.5 clocks/symbol (258.6MiB/s)
9488277 clocks, 12.3 clocks/symbol (262.1MiB/s)
9460194 clocks, 12.3 clocks/symbol (262.9MiB/s)
9582025 clocks, 12.5 clocks/symbol (259.5MiB/s)
9332017 clocks, 12.1 clocks/symbol (266.5MiB/s)
interleaved rANS: 435117 bytes
10687601 clocks, 13.9 clocks/symbol (232.7MB/s)
10637918 clocks, 13.8 clocks/symbol (233.8MB/s)
10909652 clocks, 14.2 clocks/symbol (227.9MB/s)
10947637 clocks, 14.2 clocks/symbol (227.2MB/s)
10529464 clocks, 13.7 clocks/symbol (236.2MB/s)
decode ok!
----
And here's rans64 in 64-bit mode:
----
rANS encode:
10256075 clocks, 13.3 clocks/symbol (242.3MiB/s)
10620132 clocks, 13.8 clocks/symbol (234.1MiB/s)
10043080 clocks, 13.1 clocks/symbol (247.6MiB/s)
9878205 clocks, 12.8 clocks/symbol (251.8MiB/s)
10122645 clocks, 13.2 clocks/symbol (245.7MiB/s)
rANS: 435116 bytes
14244155 clocks, 18.5 clocks/symbol (174.6MiB/s)
15072524 clocks, 19.6 clocks/symbol (165.0MiB/s)
14787604 clocks, 19.2 clocks/symbol (168.2MiB/s)
14736556 clocks, 19.2 clocks/symbol (168.8MiB/s)
14686129 clocks, 19.1 clocks/symbol (169.3MiB/s)
decode ok!
interleaved rANS encode:
7691159 clocks, 10.0 clocks/symbol (323.3MiB/s)
7182692 clocks, 9.3 clocks/symbol (346.2MiB/s)
7060804 clocks, 9.2 clocks/symbol (352.2MiB/s)
6949201 clocks, 9.0 clocks/symbol (357.9MiB/s)
6876415 clocks, 8.9 clocks/symbol (361.6MiB/s)
interleaved rANS: 435120 bytes
8133574 clocks, 10.6 clocks/symbol (305.7MB/s)
8631618 clocks, 11.2 clocks/symbol (288.1MB/s)
8643790 clocks, 11.2 clocks/symbol (287.7MB/s)
8449364 clocks, 11.0 clocks/symbol (294.3MB/s)
8331444 clocks, 10.8 clocks/symbol (298.5MB/s)
decode ok!
----
Finally, here's the rans_word_sse41 decoder on an 8-way interleaved stream:
----
SIMD rANS: 435626 bytes
4597641 clocks, 6.0 clocks/symbol (540.8MB/s)
4514356 clocks, 5.9 clocks/symbol (550.8MB/s)
4780918 clocks, 6.2 clocks/symbol (520.1MB/s)
4532913 clocks, 5.9 clocks/symbol (548.5MB/s)
4554527 clocks, 5.9 clocks/symbol (545.9MB/s)
decode ok!
----
There's also an experimental 16-way interleaved AVX2 version that hits
faster rates still, developed by my colleague Won Chun; I will post it
soon.
Note that this is running "book1" which is a relatively short test, and
the measurement setup is not great, so take the results with a grain
of salt.
-Fabian "ryg" Giesen, Feb 2014.