Performance Incident Report

2020-02-17

Last week was a tough one. It was a good week for crypto and the exchange businesses. Bitcoin price was up, so were trading volumes. But for a number of our users and for our engineers, it was a tough week. We had a number of performance issues, which negatively impacted the accessibility of our platform. As always, we feel it is important to maintain transparency during tough times, and we will openly disclose some of the issues we experienced.

The difference between BTC at $10,000 this time around and the previous times is, there are a lot more users now. While this is a solid sign of strong recovery for the crypto market, it also puts on significant load for our systems. 

Over the past few days, we experienced two main problems:

1. Intermittent UI error of “Too many requests,” “5xx internal error,” and API timeout errors. This was due to our middle layer service being overloaded very quickly. Each affected user will likely repeatedly retry, further increasing the load on the system. The issue was resolved in the short term by increasing the resource levels. However, there are limitations to this approach as well. Due to the complex nature of this module, it does not auto-scale well. A new instance takes minutes to sync up the initial snapshot to be able to handle normal traffic. Further work was already in progress to address this but didn’t finish before the massive traffic hit us just a few days ago. This has moved up in priority and will be fully deployed in the next few days. There are also optimizations on the client-side to improve error handling and not further increase the load on the system.

2. Market data/order/balance update delays. We had multiple issues with our message brokers as well. One of the message brokers sub-components that typically pushes out 2.5GB+ of data per second, suddenly dropped the throughput by 100x, causing messages to be backed up, resulting in the order book as well as user balance update delays. In another instance, a kafka cluster crashed with successive multiple node failures under peak traffic. Restarting it solved the immediate problem/symptoms. Midterm improvements are underway to further split them into separate topics to be handled by multiple kafka clusters. This is estimated to increase the load handling capacity of this component by 10x or so and will roll out this week as well. Longer-term solutions are also in progress to increase capacity further.

Looking at the bigger picture, we rolled out many features during the bear market, while we stress-tested them like crazy in our test environments, the test environments don’t always reflect live environments where we have tens of millions of users all around the world. There have been areas with “performance creep”, like adding a little bit more usage/stress to an existing message broker here and there, thinking we still have 10-100x performance headroom, while in reality, we may be down to 3-5x. And the market increases in big spikes, not 3-5x, but easily 10x+ in terms of system loads.

On a positive note, these are all issues we can solve. Our architecture is sound, and we have one of the best and most capable teams in this industry. We will solve these issues quickly, short, mid and long term fixes. I won’t be able to guarantee all smooth sailing from here. We are bound to run into issues in the future as well, and we are confident we will solve them quickly.

In our short history, Binance has encountered many challenges, and we have solved them. Binance did not become an industry leader by doing the easy things, we pride ourselves in solving the difficult challenges, and protecting our users while doing that.

If you feel you were unfairly affected by the issues that recently occurred on Binance.com, please submit a support ticket here in as much detail as possible and the team will review it ASAP.

We always believe that transparency is the foundation of the blockchain-enabled world. We are not afraid of challenges and difficulties; more importantly, we have the courage and ability to be responsible. Protecting users is our core value. While we work hard to further optimize our systems, we will continue to disclose information transparently.

I apologize for any inconvenience caused, and please know how much we appreciate your support. As always, I will be active on Twitter If you need to reach me. 

- CZ, CEO @Binance