Technical: Analyzing the advantages and disadvantages of FPGA and GPU accelerated zero-knowledge proof computing

Author: Star Li
This article is only for industry learning and communication purposes and does not constitute any investment reference. If you need to quote, please indicate the source. If you want to reprint, please contact the IOSG team to obtain authorization and reprint instructions. Special thanks to the author Star Li for providing the content!
Zero-knowledge proof technology is increasingly used, including privacy proof, computational proof, consensus proof, etc. While looking for more and better application scenarios, many people gradually find that zero-knowledge proof performance is a bottleneck. The Trapdoor Tech team has been conducting in-depth research on zero-knowledge proof technology since 2019, and has been exploring efficient zero-knowledge proof acceleration solutions. GPU or FPGA is a common acceleration platform on the market. This article starts with the calculation of MSM and analyzes the advantages and disadvantages of FPGA and GPU acceleration of zero-knowledge proof calculation.
TL;DR
ZKP is a technology with broad prospects in the future. More and more applications are beginning to adopt zero-knowledge proof technology. However, there are many ZKP algorithms, and various projects use different ZKP algorithms. At the same time, the computing performance of ZKP proof is relatively poor. This article analyzes the MSM algorithm, elliptic curve point addition algorithm, Montgomery multiplication algorithm, etc. in detail, and compares the performance difference between GPU and FPGA in BLS12_381 curve point addition. In general, in terms of ZKP proof calculation, GPU has obvious advantages in the short term, with high throughput, high cost performance, programmability, etc. FPGA has certain advantages in power consumption. In the long run, there may be FPGA chips suitable for ZKP calculations, or ASIC chips customized for ZKP.
ZKP algorithm is complex
ZKP is a general term for zero-knowledge proof technology. There are two main categories: zk-SNARK and zk-STARK. The most common zk-SNARK algorithms are Groth16, PLONK, PLOOKUP, Marlin and Halo/Halo2. The iteration of zk-SNARK algorithm is mainly along two directions: 1/whether trusted setup is required 2/the performance of the circuit structure. The advantage of zk-STARK algorithm is that it does not require trusted setup, but the verification calculation amount is logarithmic linear.
In terms of the application of zk-SNARK/zk-STARK algorithms, the zero-knowledge proof algorithms used by different projects are relatively scattered. In the application of zk-SNARK algorithms, because PLONK/Halo2 algorithms are universal (no trusted setup is required), their applications may increase.
PLONK proves computational effort
Taking the PLONK algorithm as an example, let’s analyze the computational complexity of the PLONK proof.
The computational complexity of the PLONK proof consists of four parts:
1/ MSM - Multiple Scalar Multiplication. MSM is often used to calculate polynomial commitments.
2/ NTT calculations - polynomials convert between point value and coefficient representation.
3/ Polynomial calculations - polynomial addition, subtraction, multiplication and division. Polynomial evaluation (Evaluation) and more.
4/ Circuit Synthesize - Circuit Synthesis. This part of the calculation is related to the size/complexity of the circuit.
The computational workload of the Circuit Synthesize part generally involves more judgment and loop logic, and has a lower degree of parallelism, which is more suitable for CPU computing. Generally speaking, zero-knowledge proof acceleration generally refers to the computational acceleration of the first three parts. Among them, MSM has the largest computational workload, followed by NTT.
 What's MSM
MSM (Multiple Scalar Multiplication) means that given a series of points and scalars on an elliptic curve, the points corresponding to the results of adding these points are calculated.
For example, given a series of points on an elliptic curve:
Given a fixed set of Elliptic curve points from one specified curve:
[G_1, G_2, G_3, ..., G_n]
And the random coefficients:
and a randomly sampled finite field elements from specified scalar field:
[s_1, s_2, s_3, ..., s_n]
MSM is the calculation to get the Elliptic curve point Q:
Q = \sum_{i=1}^{n}s_i*G_i
The industry generally uses the Pippenger algorithm to optimize MSM calculations. Let's take a closer look at the schematic diagram of the Pippenger algorithm process:
The calculation process of the Pippenger algorithm is divided into two steps:
1/ Scalar is divided into Windows. If Scalar is 256 bits and a Window is 8 bits, then all Scalars are divided into 256/8=32 Windows. For each layer of Window, a "Bucket" is used to temporarily store the intermediate results. GW_x is the point of the accumulated result on a layer. Calculating GW_x is also relatively simple. Traverse each Scalar in a layer in turn, and add the corresponding G_x to the corresponding Bucket bit according to the value of the Scalar layer as the Index. In fact, the principle is also relatively simple. If the coefficients of two points are the same, add the two points first and then do a Scalar addition, instead of doing two Scalar additions twice and then accumulating.
2/ The points calculated by each Window are then accumulated by double-add to get the final result.
There are many variants of the Pippenger algorithm. In any case, the underlying calculation of the MSM algorithm is point addition on the elliptic curve. Different optimization algorithms correspond to different numbers of point additions.
Elliptic Curve Point Add
You can look at various algorithms for point addition on elliptic curves in "short Weierstrass" form from this website.
http://www.hyperelliptic.org/EFD/g1p/auto-shortw-jacobian-0.html#addition-madd-2007-bl
Assuming that the projective coordinates of two points are (x1, y1, z1) and (x2, y2, z2) respectively, the result of point addition (x3, y3, z3) can be calculated by the following formula.
The reason for giving the calculation process in detail is to show that the entire calculation process is mostly integer operations. The bit width of the integer depends on the parameters of the elliptic curve. Here are some common elliptic curve bit widths:
 BN256 - 256bits BLS12_381 - 381bits BLS12_377 - 377bits
It is important to note that these integer operations are performed on the modular domain. Modular addition and subtraction are relatively simple, so let's focus on the principle and implementation of modular multiplication.
 模乘（Modular Muliplication）
Given two values ​​on the modular field: x and y. Modular multiplication refers to x*y mod p. Note that the bit width of these integers is the bit width of the elliptic curve. The classic algorithm for modular multiplication is Montgomery Muliplication. Before performing Montgomery multiplication, the multiplicand needs to be converted to Montgomery representation:
The Montgomery multiplication formula is as follows:
There are many algorithms for implementing Montgomery multiplication: CIOS (Coarsely Integrated Operand Scanning), FIOS (Finely Integrated Operand Scanning), and FIPS (Finely Integrated Product Scanning), etc. This article does not go into the details of the implementation of various algorithms, and interested readers can study them on their own.
In order to compare the performance differences between FPGA and GPU, the most basic algorithm implementation method is selected:
Simply put, modular multiplication algorithms can be further divided into two types of calculations: large number multiplication and large number addition. After understanding the calculation logic of MSM, you can choose the performance of modular multiplication (Throughput) to compare the performance of FPGA and GPU.
Observation and thinking
With this FPGA design, we can estimate the point addition throughput that the entire VU9P can provide on the BLS12_381 elliptic curve. One point addition (add_mix mode) requires about 12 modular multiplications. The system clock of the FPGA is 450M.
Under the same modular multiplication/modular addition algorithm, using the same point addition algorithm, the point addition Troughput of Nvidia 3090 (taking data transmission factors into account) exceeds 500M/s. Of course, the entire calculation involves multiple algorithms, and some algorithms may be suitable for FPGAs and some algorithms may be suitable for GPUs. The reason for using the same algorithm for comparison is to compare the core computing capabilities of FPGAs and GPUs.
Based on the above results, let’s summarize the comparison between GPU and FPGA in terms of ZKP proof performance:
 Summarize
More and more applications are beginning to adopt zero-knowledge proof technology. However, there are many ZKP algorithms, and various projects use different ZKP algorithms. From our practical engineering experience, FPGA is an option, but GPU is currently a cost-effective option. FPGA prefers deterministic computing and has advantages in latency and power consumption. GPU has high programmability, a relatively mature high-performance computing framework, a short development iteration cycle, and prefers scenarios that require throughput.