-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fractal - Benchmark, Parallelization and Memory usage #7
Comments
Thanks for all of the nice graphs! The current state of the internal development repo is that we can get Fractal to run on a 192 bit field in slightly better than current prover time, and 250GB. (I'll update this repo soon.TM once some other changes land :)) I'll answer Q1 from the perspective of if we can further improve that 250GB usage. Q1: The RAM footprint can be significantly reduced, however it will still be large. Heres several (large) areas of improvement we have left for RAM usage:
So this leaves us with being around 80GB of RAM footprint (expected) for a circuit of size 2^22. There are some other fancier tricks we can do for this, but they're of much lower significance than the above. Q2: Yes, the whole prover can be parallelized! I actually suspect that it will take more parallel cores to saturate how well STARKs can be parallelized vs pairing based ones. Even with the high memory usage, no single component uses much memory and if one is being careful, too much data overall doesn't have to be passed around. Up to 32 cores, if everything is well designed for parallelism, you should be able to get a near 30x speedup, with each thread initially receiving O(H) information, and only having to communicate a constant amount of information to a master thread. Beyond 32 cores, you are limited by FFT parallelization techniques, which are still shown to be quite good: https://github.com/scipr-lab/libfqfft has a graph that shows that for 4 cores of parallelism on the basic radix 2 FFT, you are pretty close to a 4x speedup. So with 128 cores (and alot of engineering to achieve the parallelism) you should be close to a 128x speedup. Achieving the ideal minimal data shuffling a lot of engineering to build a map-reduce like workload, similar to the large amounts of work that went into DIZK, however achieving this just in terms of doing all computation in parallel is do-able near term if anyone has interest in working on it. (However I don't know how to estimate data passing time vs computation time) It really boils down to implementing parallel FFTs, parallel MT creation, and parallellizing some of the O(L) algorithms. A bit of an explanation about the ideal map/reduce if setup you are curious: Let H be the circuit size. The provers work / memory usage comes from having to do things in a domain |
Hi,
I am working on the topic 'succinct blockchains' with the focus on a concrete prototype implementation. I am aware of the following three approaches that had published both a theory and a concrete source code on the subject of recursive proofs - i) Coda protocol ii) Halo and iii) Fractal.
The Coda protocol uses a cycle of MNT pairing-friendly elliptic curves with about 800 bits to achieve a security level of 128 bits due to low embedding degrees. The Halo paper argues that it is possible to amortize away the recursive proof verification and the cost of the inner vector product by leveraging a cycle of two non-pairing-friendly curves. The Fractal paper describes a probabilistic proof approach with preprocessing of SNARK and holographic IOP.
My goal is to look at the different approaches from a technical perspective. In particular, the topic of parallelization possibilities and memory footprint is crucial.
I've checked the runtime behavior from Fractal and Halo. I am aware that this is a banana vs. apple comparison. It is only a matter of basic principles.
1) Halo
I used the https://github.com/ebfull/halo/blob/master/examples/bitcoin.rs benchmark. I could not determine how many R1CS constrains equivalent the circuit for the Bitcoin example contains.
Single-threaded:
creating proof 1
done, took 61880.355779868s
verifying proof 1
done, took 3445.637548892s
creating proof 2
done, took 59251.202629717s
verifying proof 2
done, took 4363.412253224s
Multi-threaded (48 cores):
creating proof 1
done, took 1389.555774298s
verifying proof 1
done, took 137.547002293s
creating proof 2
done, took 1408.376340113s
verifying proof 2
done, took 167.663635282s
From the above runtimes figures, one can get about ~30x speedup due to parallelization. Interesting from the technical point of view is the required memory footprint. About max. 4 GiB RAM are needed during the proof generation. Due to the small RAM footprint, an efficient parallelization to a GPU could also be possible.
2) Fractal
The current implementation of Fractal is single-threaded. I tested a R1CS circuit with 2^22 constraints:
./instrument_fractal_snark --make_zk 1 --is_multiplicative 1 --field_size=181 --optimize_localization=1 --log_n_min=22 --log_n_max=22
It turns out that Fractal needs about 250+ GiB RAM in the above test case. I would like to ask the following questions.
Q1: Can the RAM footprint be (significantly) reduced?
Q2: Is there a possibility for parallelization? See also #6
Thank you
/Jiri
The text was updated successfully, but these errors were encountered: