Fractal - Benchmark, Parallelization and Memory usage #7

ramtej · 2019-12-29T16:59:18Z

Hi,

I am working on the topic 'succinct blockchains' with the focus on a concrete prototype implementation. I am aware of the following three approaches that had published both a theory and a concrete source code on the subject of recursive proofs - i) Coda protocol ii) Halo and iii) Fractal.

The Coda protocol uses a cycle of MNT pairing-friendly elliptic curves with about 800 bits to achieve a security level of 128 bits due to low embedding degrees. The Halo paper argues that it is possible to amortize away the recursive proof verification and the cost of the inner vector product by leveraging a cycle of two non-pairing-friendly curves. The Fractal paper describes a probabilistic proof approach with preprocessing of SNARK and holographic IOP.

My goal is to look at the different approaches from a technical perspective. In particular, the topic of parallelization possibilities and memory footprint is crucial.

I've checked the runtime behavior from Fractal and Halo. I am aware that this is a banana vs. apple comparison. It is only a matter of basic principles.

1) Halo

I used the https://github.com/ebfull/halo/blob/master/examples/bitcoin.rs benchmark. I could not determine how many R1CS constrains equivalent the circuit for the Bitcoin example contains.

Single-threaded:

creating proof 1
done, took 61880.355779868s
verifying proof 1
done, took 3445.637548892s

creating proof 2
done, took 59251.202629717s
verifying proof 2
done, took 4363.412253224s

Multi-threaded (48 cores):

creating proof 1
done, took 1389.555774298s
verifying proof 1
done, took 137.547002293s

creating proof 2
done, took 1408.376340113s
verifying proof 2
done, took 167.663635282s

From the above runtimes figures, one can get about ~30x speedup due to parallelization. Interesting from the technical point of view is the required memory footprint. About max. 4 GiB RAM are needed during the proof generation. Due to the small RAM footprint, an efficient parallelization to a GPU could also be possible.

2) Fractal

The current implementation of Fractal is single-threaded. I tested a R1CS circuit with 2^22 constraints:

./instrument_fractal_snark --make_zk 1 --is_multiplicative 1 --field_size=181 --optimize_localization=1 --log_n_min=22 --log_n_max=22

$instrument_fractal_snark-memory_stats$

It turns out that Fractal needs about 250+ GiB RAM in the above test case. I would like to ask the following questions.

Q1: Can the RAM footprint be (significantly) reduced?
Q2: Is there a possibility for parallelization? See also #6

Thank you

/Jiri

ValarDragon · 2019-12-30T00:06:03Z

Thanks for all of the nice graphs! The current state of the internal development repo is that we can get Fractal to run on a 192 bit field in slightly better than current prover time, and 250GB. (I'll update this repo soon.TM once some other changes land :)) I'll answer Q1 from the perspective of if we can further improve that 250GB usage.

Q1: The RAM footprint can be significantly reduced, however it will still be large. Heres several (large) areas of improvement we have left for RAM usage:

For the prover key, we store the MT authentication paths, leaf hashes, and every codeword for these indexed oracles. We can instead store only the MT auth paths, and the polynomials for every indexed oracle. This is a roughly 75GB of RAM improvement at this circuit size. This will make the prover slightly slower (The expected time increase is comparable to about two more FFTs slower)
We have found tricks that let us halve the domain size and prover time, with an ~10% increase to proof size. This would halve the RAM requirement.
The point where we use peak RAM is in the "LDT Reducer". What it does is computes all virtual oracles, and all oracles, and does a random linear combination of them. However it doesn't need to store all these virtual oracles in memory at the same time, it only needs one at a time. So it could be 'streamed'. This is actually likely to both speed up the code, and improve memory usage. I haven't done a careful estimate of how much RAM this improves, but I would guess its more than 10GB even after point 2.

So this leaves us with being around 80GB of RAM footprint (expected) for a circuit of size 2^22. There are some other fancier tricks we can do for this, but they're of much lower significance than the above.

Q2: Yes, the whole prover can be parallelized! I actually suspect that it will take more parallel cores to saturate how well STARKs can be parallelized vs pairing based ones. Even with the high memory usage, no single component uses much memory and if one is being careful, too much data overall doesn't have to be passed around.

Up to 32 cores, if everything is well designed for parallelism, you should be able to get a near 30x speedup, with each thread initially receiving O(H) information, and only having to communicate a constant amount of information to a master thread. Beyond 32 cores, you are limited by FFT parallelization techniques, which are still shown to be quite good: https://github.com/scipr-lab/libfqfft has a graph that shows that for 4 cores of parallelism on the basic radix 2 FFT, you are pretty close to a 4x speedup. So with 128 cores (and alot of engineering to achieve the parallelism) you should be close to a 128x speedup. Achieving the ideal minimal data shuffling a lot of engineering to build a map-reduce like workload, similar to the large amounts of work that went into DIZK, however achieving this just in terms of doing all computation in parallel is do-able near term if anyone has interest in working on it. (However I don't know how to estimate data passing time vs computation time) It really boils down to implementing parallel FFTs, parallel MT creation, and parallellizing some of the O(L) algorithms.

A bit of an explanation about the ideal map/reduce if setup you are curious: Let H be the circuit size. The provers work / memory usage comes from having to do things in a domain L that is much bigger than H. However for almost all of the computation, it can be split up with no efficiency loss into L/H separate parallel chunks, that each only need O(H) field elements to describe. (My guess is that the big O constant is 13). At the end of each round, you need to give the MT root of your size H work to the master thread. The master thread collects all the roots, and further makes a MT of them, and then gets the verifier randomness, and finally sends the randomness back to the processes

ramtej changed the title ~~Fractal - Parallelization, Benchmark and Memory usage~~ Fractal - Benchmark, Parallelization and Memory usage Dec 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fractal - Benchmark, Parallelization and Memory usage #7

Fractal - Benchmark, Parallelization and Memory usage #7

ramtej commented Dec 29, 2019

ValarDragon commented Dec 30, 2019 •

edited

Loading

Fractal - Benchmark, Parallelization and Memory usage #7

Fractal - Benchmark, Parallelization and Memory usage #7

Comments

ramtej commented Dec 29, 2019

ValarDragon commented Dec 30, 2019 • edited Loading

ValarDragon commented Dec 30, 2019 •

edited

Loading