We would like to thank the Polygon Zero team, the Consensys gnark project, Pado Labs, and the Delphinus Lab team for their valuable comments and feedback on this article.

Over the past few months, we have invested a lot of time and effort in developing cutting-edge infrastructure built with zk-SNARK succinct proofs. This next-generation innovative platform enables developers to build new paradigms of blockchain applications that were never before possible.

In our development work, we have tested and used multiple zero-knowledge proof (ZKP) development frameworks. While this journey has been rewarding, we do realize that the variety of ZKP frameworks often creates challenges for new developers when trying to find the framework that best suits their specific use cases and performance requirements. With this pain point in mind, we believe that a community evaluation platform that can provide comprehensive performance testing results is needed, which will greatly facilitate the development of these new applications.

To meet this need, we launched the public welfare community initiative "Patheon", a zero-knowledge proof development framework evaluation platform. The first step of the initiative will encourage the community to share reproducible performance test results of various ZKP frameworks. Our ultimate goal is to work together to create and maintain a widely recognized test platform to evaluate low-level circuit development frameworks, high-level zkVM and compilers, and even hardware acceleration providers. We hope that this initiative will enable developers to have more references for performance comparison when choosing a framework, thereby accelerating the promotion of ZKP. At the same time, we hope to promote the upgrade and iteration of the ZKP framework itself by providing a set of generally referenceable performance test results. We will invest heavily in this plan and invite all like-minded community members to join us and contribute to this work together!

Step 1: Performance testing of the circuit framework using SHA-256

In this post, we took the first steps towards building ZKP Patheon by providing a set of reproducible performance test results using SHA-256 in a series of low-level circuit development frameworks. While we acknowledge that other performance test granularities and primitives may be possible, we chose SHA-256 due to its applicability to a wide range of ZKP use cases, including blockchain systems, digital signatures, zkDIDs, etc. It’s also worth mentioning that we use SHA-256 in our own systems, so this was also convenient for us! 😂

Our benchmark evaluates the performance of SHA-256 on various zk-SNARK and zk-STARK circuit development frameworks. Through this comparison, we seek to provide developers with insights into the efficiency and practicality of each framework. Our goal is that these findings will enable developers to make informed decisions when selecting the most suitable framework for their projects.

Our performance tests evaluated the performance of SHA-256 on various zk-SNARK and zk-STARK circuit development frameworks. Through comparison, we strive to provide developers with insights into the efficiency and practicality of each framework. Our goal is to provide developers with a reference for making an informed decision when choosing the best framework.

Proof System

In recent years, we have observed a surge in zero-knowledge proof systems. Keeping up with all the exciting advances in the field is challenging, and we carefully selected the following proof systems for testing based on maturity and developer adoption. Our goal is to provide a representative sample of different front-end/back-end combinations.

  1. Circom + snarkjs / rapidsnark: Circom is a popular DSL for writing circuits and generating R1CS constraints, and snarkjs is able to generate Groth16 or Plonk proofs for Circom. Rapidsnark is also a prover for Circom, it generates Groth16 proofs and is generally much faster than snarkjs due to its use of the ADX extension and parallelizes proof generation whenever possible.

  2. gnark: gnark is a comprehensive Golang framework from Consensys that supports Groth16, Plonk, and many more advanced features.

  3. Arkworks: Arkworks is a comprehensive Rust framework for zk-SNARKs.

  4. Halo2 (KZG): Halo2 is a zk-SNARK implementation of Zcash with Plonk. It comes with highly flexible Plonkish arithmetic and supports many useful primitives such as custom gateways and lookup tables. We use KZG's Halo2 fork with support from the Ethereum Foundation and Scroll.

  5. Plonky2: Plonky2 is a SNARK implementation based on PLONK and FRI technology from Polygon Zero. Plonky2 uses small Goldilocks fields and supports efficient recursion. In our performance tests, we target 100 bits of security and use parameters that yield the best proof time for the performance test work. Specifically, we use 28 Merkle queries, an amplification factor of 8, and a 16-bit proof-of-work challenge. In addition, we set num_of_wires = 60 and num_routed_wires = 60.

  6. Starky: Starky is Polygon Zero's high-performance STARK framework. In our performance tests, we target 100 bits of security and use parameters that yield the best proof times. Specifically, we use 90 Merkle queries, a 2x amplification factor, and a 10-bit proof-of-work challenge.

The following table summarizes the above frameworks and the associated configurations used in our performance testing. This list is by no means exhaustive, and we will also investigate many state-of-the-art frameworks/techniques in the future (e.g., Nova, GKR, Hyperplonk).

Please note that these performance test results are only for the circuit development framework. We plan to publish a separate article in the future to perform performance tests on different zkVMs (e.g., Scroll, Polygon zkEVM, Consensys zkEVM, zkSync, Risc Zero, zkWasm) and IR compiler frameworks (e.g., Noir, zkLLVM).

frame

Arithmetic

algorithm

Number range

Other Configuration

Circom + snarkjs / rapidsnark

R1CS

Groth16

BN254 scalar

 

gnark

R1CS

Groth16

BN254 scalar

 

Arkworks

R1CS

Groth16

BN254 scalar

 

Halo2 (KZG)

Plonkish

KZG

BN254 scalar

 

Plonky2

Plonk

FRI

Goldilocks

blowup factor = 8

proof of work bits = 16

query rounds = 28

num_of_wires = 60 num_routed_wires = 60

Starky

AIR

FRI

Goldilocks

blowup factor = 2

proof of work bits = 10

query rounds = 90

Performance Evaluation Methodology

To perform performance tests on these different proof systems, we computed the SHA-256 hash of N bytes of data, where we experimented with N = 64, 128, ..., 64K (Starky is an exception where the circuit repeats the SHA-256 computation for a fixed 64-byte input but keeps the same total number of message blocks). The performance code and SHA-256 circuit configurations can be found in this repository.

Additionally, we performed performance tests on each system using the following performance metrics:

  • Proof generation time (including witness generation time)

  • Memory usage peaks during proof generation

  • Average CPU usage percentage during attestation generation. (This metric reflects the degree of parallelization in the attestation generation process)

Note that we are making some “arbitrary” assumptions about proof size and proof verification costs, as these aspects can be mitigated by combining with Groth16 or KZG before on-chain.

machine

We performed performance tests on two different machines:

  • Linux server: 20 cores @ 2.3 GHz, 384GB RAM

  • Macbook M1 Pro: 10 cores @ 3.2Ghz, 16GB RAM

The Linux server is used to simulate a scenario with many CPU cores and ample memory. The Macbook M1 Pro, which is usually used for research and development, has a more powerful CPU but fewer cores.

We enabled optional multithreading, but we did not use GPU acceleration in this performance test. We plan to perform GPU performance testing in the future.

Performance test results

Constraint Quantity

Before we move on to the detailed performance test results, it is useful to first understand the complexity of SHA-256 by looking at the number of constraints in each proof system. It is important to note that the number of constraints in different arithmetic schemes cannot be directly compared.

The results below are for a pre-image size of 64KB. While the results may vary for other pre-image sizes, they scale roughly linearly.

  • Circom, gnark, and Arkworks all use the same R1CS algorithm, and the number of R1CS constraints for calculating 64KB SHA-256 is roughly between 30M and 45M. The differences between Circom, gnark, and Arkworks may be due to configuration differences.

  • Both Halo2 and Plonky2 use Plonkish arithmetic, where the number of rows ranges from 2^22 to 2^23. Halo2's SHA-256 implementation is much more efficient than Plonky2's due to the use of a lookup table.

  • Starky uses the AIR algorithm, where executing the tracking table requires 2^16 transformation steps.

Proof System

Constraint Quantity (64KB SHA-256)

Circom

32M

gnark

45M

Arkworks

43M

Halo2

4M rows (K=22)

Plonky2

8M rows (K=23)

Starky

2^16 transition steps

 

Proof generation time

 

[Figure 1] We tested the proof generation time of each framework for SHA-256 on various original image sizes using a Linux server. We can get the following findings:

  • For SHA-256, the Groth16 frameworks (rapidsnark, gnark, and Arkworks) generate proofs faster than the Plonk frameworks (Halo2 and Plonky2). This is because SHA-256 consists mostly of bitwise operations where the wire values ​​are either 0 or 1. For Groth16, this reduces most of the computation from elliptic curve scalar multiplication to elliptic curve point addition. However, the wire values ​​are not directly used in Plonk's computations, so the special wire structure in SHA-256 does not reduce the amount of computation required in the Plonk framework.

  • Among all Groth16 frameworks, gnark and rapidsnark are 5 to 10 times faster than Arkworks and snarkjs. This is due to their superior ability to parallelize proof generation using multiple cores. Gnark is 25% faster than rapidsnark.

  • For the Plonk framework, Plonky2's SHA-256 is 50% slower than Halo2's when using larger preimage sizes >= 4KB. This is because Halo2's implementation mainly uses lookup tables to speed up bitwise operations, resulting in 2 times fewer lines than Plonky2. However, if we compare Plonky2 and Halo2 with the same number of lines (e.g., SHA-256 over 2KB in Halo2 vs. SHA-256 over 4KB in Plonky2), Plonky2 is 50% faster than Halo2. If we implement SHA-256 using lookup tables in Plonky2, we should expect Plonky2 to be faster than Halo2 despite the larger proof size of Plonky2.

  • On the other hand, when the input preimage size is small (<=512 bytes), Halo2 is slower than Plonky2 (and other frameworks) due to the fixed setup cost of the lookup table dominating the constraint. However, as the preimage size increases, Halo2's performance becomes more competitive, and its proof generation time remains constant for preimage sizes up to 2KB, which scales almost linearly as shown in the figure.

  • As expected, Starky’s proof generation time is much shorter (5x-50x) than any SNARK framework, but this comes at the expense of larger proof size.

  • Also note that even though the circuit size scales linearly with the pre-image size, proof generation for SNARKs scales superlinearly due to the O(nlogn) FFT (although this is not obvious on the graph due to the logarithmic scale).

We also performed attestation generation time performance testing on a Macbook M1 Pro, as shown in [Figure 2]. However, it should be noted that rapidsnark was not included in this performance test due to the lack of support for the arm64 architecture. In order to use snarkjs on arm64, we had to generate witnesses using webassembly, which is slower than the C++ witness generation used on the Linux server.

A few additional observations when running performance tests on the Macbook M1 Pro:

  • All SNARK frameworks except Starky experience out-of-memory (OOM) errors or use swap memory (resulting in slower proof times) when preimage size gets larger. Specifically, Groth16 frameworks (snarkjs, gnark, Arkworks) start using swap memory when preimage size >= 8KB, and gnark runs into out-of-memory when preimage size >= 64KB. Halo2 hits memory limits when preimage size >= 32KB. Plonky2 starts using swap memory when preimage size >= 8KB.

  • The FRI-based frameworks (Starky and Plonky2) are about 60% faster on the Macbook M1 Pro than on the Linux server, while the other frameworks have similar proof times on both machines. So even though no lookup tables are used in Plonky2, it achieves almost the same proof time as Halo2 on the Macbook M1 Pro. The main reason is that the Macbook M1 Pro has a more powerful CPU but fewer cores. FRI mainly performs hashing operations, which are sensitive to CPU clock cycles, but are not as parallelizable as KZG or Groth16.

 

Peak memory usage

[Figure 3] and [Figure 4] show the peak memory usage during attestation generation on Linux Server and Macbook M1 Pro, respectively. The following observations can be made based on these performance test results:

  • Among all SNARK frameworks, rapidsnark is the most memory efficient. We also see that Halo2 uses more memory when the pre-image size is small due to the fixed setup cost of the lookup table, but consumes less memory overall when the pre-image size is large.

  • Starky is more than 10 times more memory efficient than the SNARK framework, in part because it uses fewer lines.

  • It should be noted that the peak memory usage on the Macbook M1 Pro remains relatively flat due to the larger pre-image size due to the use of swap memory.

 

CPU Utilization

We evaluate the degree of parallelization of each proof system by measuring the average CPU utilization during proof generation for SHA-256 with a 4KB preimage input. The following table shows the average CPU utilization on a Linux Server (20 cores) and a Macbook M1 Pro (10 cores) (average utilization per core in parentheses).

The main observations are as follows:

  • Gnark and rapidsnark show the highest CPU utilization on the Linux server, indicating that they are able to effectively use multiple cores and parallelize proof generation. Halo2 also shows good parallelization performance.

  • Most frameworks have twice the CPU utilization on a Linux server than on a Macbook Pro M1, with the exception of snarkjs.

  • Although it was initially expected that the FRI-based frameworks (Plonky2 and Starky) might have difficulty using multiple cores effectively, they performed no worse than some of the Groth16 or KZG frameworks in our performance tests. It remains to be seen whether there will be a difference in CPU utilization on machines with more cores (e.g., 100 cores).

Proof System

CPU utilization (average per core utilization)

(Linux Server)

CPU utilization (average per core utilization)

(MBP M1)

snarkjs

557% (27.85%)

486% (48.6%)

rapidsnark

1542% (77.1%)

N/A

gnark

1624% (81.2%)

720% (72%)

Arkworks

935% (46.75%)

504% (50.4%)

Halo2 (KZG)

1227% (61.35%)

588% (58.8%)

Plonky2

892% (44.6%)

429% (42.9%)

Starky

849% (42.45%)

335% (33.5%)

 

Conclusion and future research

This post provides a comprehensive comparison of SHA-256 performance test results on various zk-SNARK and zk-STARK development frameworks. Through the comparison, we gain insights into the efficiency and practicality of each framework in the hope of helping developers who need to generate succinct proofs for SHA-256 operations. We found that Groth16 frameworks (e.g., rapidsnark, gnark) are faster at generating proofs than Plonk frameworks (e.g., Halo2, Plonky2). The lookup table in Plonkish arithmeticization significantly reduces the constraint and proof time of SHA-256 when using larger preimage sizes. In addition, gnark and rapidsnark demonstrate excellent ability to leverage multiple cores to parallelize operations. On the other hand, Starky's proof generation time is much faster, but at the cost of much larger proof size. In terms of memory efficiency, rapidsnark and Starky outperform other frameworks.

 

As the first step in building the zero-knowledge proof evaluation platform "Patheon", we acknowledge that the results of this performance test are far from sufficient to become the comprehensive test platform we ultimately hope to build. We welcome and are happy to accept feedback and criticism, and invite everyone to contribute to this initiative to make it easier and less difficult for developers to use zero-knowledge proofs. We are also willing to provide funding to individual independent contributors to cover the computing resource costs of large-scale performance testing. Together, we hope to improve the efficiency and practicality of ZKP and benefit the community more broadly.

 

Finally, we would like to thank the Polygon Zero team, the gnark team at Consensys, Pado Labs, and the Delphinus Lab team for their valuable review and feedback on the performance test results.