来源:《Understanding Blockchain Latency and Throughput》by Lefteris Kokoris-Kogias,Paradigm.xyz
How to properly measure a (blockchain) system is rarely discussed, but it is the most important step in the system design and evaluation process. There are many consensus protocols, various performance variables and scalability trade-offs in the system.
However, until now there has not been a reliable method that everyone agrees on that allows for a reasonable apples-to-apples comparison. In this article, we will outline an approach inspired by the measurement mechanisms of systems in the data set and explore some common mistakes that can be avoided when evaluating a blockchain system.
Key indicators and their interactions
When developing blockchain systems, we should take two important metrics into consideration: latency and throughput.
The first thing users care about is transaction latency, which is the time between initiating a transaction or payment and receiving information confirming the validity of the transaction (for example, confirmation that the initiator of the transaction has enough money).
In traditional BFT systems (such as PBFT, Terdermint, Tusk, and Narwhal), once a transaction is confirmed, it will be finalized, while in the longest chain consensus mechanism (such as Nakamoto Consensus, Solana/Ethereum PoS), a transaction may be packaged into a block and then reorganized. As a result, we need to wait until the transaction reaches "k blocks deep" before it can be finalized, which causes a delay that is much longer than the time of a single confirmation.
Secondly, the throughput of the system is generally very important to system designers. This is the total load handled by the system per unit time, usually expressed as transactions per second (TPS).
At first glance, these two key metrics seem to be completely opposite things. But because throughput is derived from transactions per second, and latency is measured in seconds, it is natural to think that throughput = load / latency.
But this is not the case. Because many systems tend to generate graphs that show throughput or latency on the y-axis and the number of nodes on the x-axis, this calculation is not possible. Instead, we can generate a better graph that includes throughput/latency metrics, which is presented in a non-linear way to make the graph clear and easy to read.
When there is no contention, latency is constant, and throughput can be changed simply by changing the load on the system. This happens because the minimum overhead of sending a transaction is fixed with low contention, and the queue delay is 0, resulting in "whatever comes in, goes out".
Under high contention, throughput is constant, but simply changing the load can cause latency to vary.
This is because the system is already overloaded, and adding more load will cause the waiting queue to grow indefinitely. Even more anomalous, the latency seems to vary with the length of the experiment, an artifact of an infinitely growing queue.
These behaviors can be seen on a classic “hockey stick” or “L-shaped” plot, which depends on the distribution of inter-arrival intervals (discussed below). Therefore, the key takeaway from this post is that we should measure in hot areas, where both throughput and latency affect our benchmark, and not measure in the edge areas, where only one of them matters.
Measurement Methodology
When conducting an experiment, the experimenter has three main design options:
Open loop vs. closed loop
There are two main ways to control the flow of requests to a target. An open-loop system is modeled with n = ∞ clients sending requests to a target according to rate λ and an inter-arrival distribution (e.g., Poisson). A closed-loop system limits the number of outstanding requests at any given time. The difference between open-loop and closed-loop systems is a feature of a particular deployment, and the same system can be deployed in different scenarios.
For example, a key-value store can serve thousands of application servers in an open-loop deployment or just a few blocked clients in a closed-loop deployment.
Testing the correct deployment scenario is essential, as open-loop systems can have large waiting queues and thus longer latencies than closed-loop systems, where latency is often limited by the number of potential outstanding requests. Generally, blockchain protocols can be used by any number of clients, so it is more accurate to evaluate them in an open-loop environment.
Inter-arrival distribution for synthetic benchmarks
When creating synthetic workloads, we inevitably ask: how are requests submitted to the system? Many systems preload transactions before measuring, but this can skew the measurement because the system starts running from an exception state of 0. Additionally, the preloaded requests are already in main memory, thus bypassing its network stack.
A better approach would be to send requests at a fixed rate (e.g., 1000 TPS), which would result in an L-shaped graph (orange line) because the system's capacity is optimally used.
However, open systems tend not to behave in a predictable way. Instead, they have periods of high and low load. To model this, we can use a probability interval distribution, which is generally based on the Poisson distribution. This will result in a "hockey stick" graph (blue line) because Poisson bursts will cause some queuing delays even if the average rate is lower than the optimal value (maximum capacity). But this is good for us because we can see how the system handles high loads and how quickly it recovers when the load returns to normal.
Warm-up phase
The final point to consider is when to start measuring. We want the pipeline to be full of transactions before we start; otherwise, we will need to measure warmup latency. Ideally, warmup latency should be measured by measuring latencies during the warmup phase until the measurements follow the expected distribution.
How to compare
The final challenge is to reasonably compare various deployments of the system. Again, the difficulty is that latency and throughput are interdependent, so it may be difficult to produce a fair throughput/node count chart.
Rather than simply pushing every system to its highest throughput (in which case latency is meaningless), the best approach is to define a service level objective (SLO) and measure throughput at that time. A good way to visualize this is to draw a horizontal line on the throughput/latency graph that intersects the latency axis at the SLO and sample the intersection point.
But I set a 5 second SLO and it only took 2 seconds
One might be tempted to increase the load here to take advantage of the slightly higher throughput available after the saturation point. However, this is dangerous. If the system is operating underprovisioned, an unexpected burst of requests will cause the system to reach full saturation, causing latency to spike and quickly violate SLOs. In essence, operating after the saturation point results in an unstable equilibrium.
Therefore, there are two points to consider:
Overprovision the system. Essentially, the system should be running below the saturation point in order to absorb bursts in the arrival interval distribution without causing an increase in queuing latency. If there is room below the SLO, increase the batch size. This increases the load on the critical path of the system without increasing queuing latency, giving you the higher throughput for the higher latency tradeoff you desire.
I'm generating a huge load, how can I measure latency?
When the load on the system is high, trying to access the local clock and timestamp every transaction that arrives at the system can lead to skewed results.
Instead, there are two more viable options. The first and simplest approach is to sample transactions; for example, there could be a magic number in some transactions, and these are the transactions that clients keep timers for. After the commit time, anyone can check the blockchain to determine when these transactions committed, and thus calculate their latency. The main advantage of this approach is that it does not perturb the inter-arrival distribution. However, because some transactions must be modified, it may be considered "hacky".
A more systematic approach is to use two load generators. The first is the main load generator, which follows a Poisson distribution. The second request generator is used to measure latency and has a much lower load; this request generator can be treated as a single client compared to the rest of the system. Even if the system sends a reply to every request (as some systems do, such as a key-value store), we can easily put all the replies into the load generator and measure the latency only from the request generator.
The only tricky part is that the actual inter-arrival distribution is the sum of two random variables; however, the sum of two Poisson distributions is still a Poisson distribution, so the math isn’t that hard :).
Summarize
Measuring large-scale distributed systems is critical to identifying bottlenecks and analyzing expected behavior under stress. Hopefully, by using the methods described above, we can all take the first steps toward a common language that will ultimately make blockchain systems more applicable to the jobs they do and the promises they hold to end users.
In future work, we plan to adapt this approach to existing consensus mechanisms, so if you’re interested, please get in touch on Twitter!
Acknowledgements: All of these are lessons learned during the design and implementation of Narwhal & Tusk (Best Paper Award @ Eurosys 2022) with my co-authors, and previous comments on drafts by Marios Kogias, Joachim Neu, Georgios Konstantopoulos, and Dan Robinson.