Written by: Ingonyama
The rise of artificial intelligence has been amazing. From basic algorithms to language learning models (LLMs) like ChatGPT and Copilot, AI has been at the forefront of technological evolution. As these models interact with users and process large amounts of data and prompts, data privacy issues become particularly important. Large companies such as Amazon and Apple have restricted employee access to public APIs such as ChatGPT to prevent data leaks that may result from AI interactions. In addition, it is reasonable to predict that relevant regulations will soon be introduced to enforce a certain level of user privacy protection.
How do we ensure that the interactions, questions, and data shared with these models remain private?
Introduction to Fully Homomorphic Encryption (FHE)
In the field of cryptography, fully homomorphic encryption is a groundbreaking concept. Its appeal lies in its unique ability: it allows calculations to be performed directly on encrypted data without first decrypting the data, thus enabling private reasoning about sensitive information.
This feature ensures two important things: data remains secure during processing and model intellectual property (IP) is fully protected.
Privacy Reasoning and Intellectual Property Protection
Nowadays, "privacy" and "user experience" seem to be in a relationship of fish and bear's paw, and you can't have both. People often trust third parties to handle their information for a better user experience. We believe that these third-party companies can find a balance between user privacy and high-quality user services, without having to choose between local solutions with higher privacy but lack of features or services that sacrifice privacy for rich features.
Fully homomorphic encryption enables private reasoning while fully protecting the intellectual property of the model. By performing calculations on encrypted data, it ensures that the prompt words are completely confidential while protecting the intellectual property of the large language model.
Traditional encryption methods VS FHE
In traditional encryption schemes, if you want to perform meaningful operations on encrypted data, you first need to decrypt it. However, decryption will expose the plaintext data, which means that the data will become fragile and vulnerable to attack, even if it is only decrypted for a moment.
In contrast, fully homomorphic encryption can operate directly on ciphertext, ensuring that sensitive information remains "invisible" during the entire calculation process.
Why FHE is important
The importance of fully homomorphic encryption is not limited to theory. Imagine a cloud computing service where data can be processed without decrypting it, or a medical database where sensitive patient details can be analyzed without accessing them. The potential applications of fully homomorphic encryption are vast and varied, ranging from secure voting systems to private searches of encrypted databases.
Mathematical foundations of FHE
Fully homomorphic encryption is based on the learning with errors (LWE) problem, a lattice cryptography technique that is quantum-resistant. In LWE, random noise is used to make the data unreadable unless one has the key. It is possible to perform arithmetic operations on the encrypted data, but this usually increases the noise level. If too many operations are performed in succession, the data cannot be read by anyone, including the person holding the key. This is partially homomorphic encryption (SHE).
To convert partially homomorphic encryption to fully homomorphic encryption, an operation that can reduce the noise level is required. This operation is called "bootstrapping", and many fully homomorphic encryption schemes use bootstrapping. In this article, we will focus on the fully homomorphic encryption scheme on the torus (Torus FHE), which uses the algebraic structure of the mathematical torus to achieve fully homomorphic encryption.
Advantages of TFHE
Although each fully homomorphic encryption scheme has its own advantages and disadvantages, TFHE currently has a more efficient implementation in practical scenarios. Another important advantage of TFHE is its programmable bootstrapping (PBS), which extends the usual bootstrapping operation to include the calculation of univariate functions, such as activation functions, which are crucial in the field of machine learning.
One disadvantage of TFHE is that it requires a PBS operation for every arithmetic operation performed in the computation, whereas other schemes allow batching of operations between bootstrapping operations.
Assumptions and Approximations
To estimate the time required for large language model (LLM) inference using fully homomorphic encryption, we make some assumptions to evaluate:
The number of arithmetic operations required per token is roughly 1–2 times the number of parameters in the model. This is a lower bound, since the entire model is used per token, and we will assume this lower bound is close enough for practical needs.
Each arithmetic operation in the large language model can be mapped to an arithmetic operation in TFHE. This is basically a statement about the size of variable types in both schemes. We assume that INT4 variables are sufficient for the large language model and feasible for TFHE.
Every arithmetic operation in a large language model needs to be mapped to an arithmetic operation in fully homomorphic encryption. This means that we cannot run part of the model unencrypted. A recent blog post by Zama considers FHE inference that does not use this assumption, where most of the model is executed locally by the user without any encryption, and only a small part (e.g. a single attention head) is run on the company's servers of the model using fully homomorphic encryption. We argue that this approach does not actually protect the intellectual property of the model, because in this case the user can either run just the missing head with only a slight loss of accuracy, as shown here, or relatively cheaply train the missing part to get results comparable to the original model.
Each arithmetic operation in TFHE requires a PBS (Programmable Bootstrapping). PBS is the main bottleneck of TFHE calculation.
The current state-of-the-art TFHE implementation is FPT, an FPGA implementation that computes the PBS every 35 microseconds.
Challenges of LLM and FHE
With recent advances in technology, the best fully homomorphic encryption implementations can currently perform one arithmetic operation in just 35 microseconds. However, when considering a complex model like GPT2, a single token requires a staggering 1.5 billion operations. This means that each token takes about 52,000 seconds to process.
To put this into context, for a language model, a token can represent something like a character or a full word. Imagine interacting with a language model where the response time takes a week or two! This is unacceptable, and such latency is obviously not feasible for real-time communication or any practical application of the model.
This shows that with the current fully homomorphic encryption technology, achieving real-time reasoning for large-scale language models is still a huge challenge. Although fully homomorphic encryption is of great significance in data protection, its performance limitations in highly computationally intensive tasks may make it difficult to apply in practical scenarios. For the needs of real-time interaction and rapid response, it may be necessary to explore other secure computing and privacy protection solutions.
Potential Solutions
To make fully homomorphic encryption applicable to large language models, the following is a possible roadmap:
Use multiple machines to achieve parallel processing:
The starting value is 52,000 seconds/Token.
By deploying 10,000 parallel machines, we reduce the time to 5 seconds/token. Note that large language models are indeed highly parallelizable, and current inference is often performed in parallel on thousands of GPU cores or more.
Transitioning to Advanced Hardware:
From the improved -- starting at 5 seconds /Token
Switching to GPU or ASIC, we can achieve a processing time of 0.1 seconds per token. While GPU can provide more direct benefits in speed, ASIC can provide higher benefits in both speed and power consumption, such as the ZPU mentioned in the previous blog.
As the figure shows, private inference of large language models is possible with fully homomorphic encryption using existing data acceleration techniques. This can be supported with a large but feasible initial investment in sufficiently large data centers. However, this possibility is still remote, and there is still a gap to be filled for even larger large language models such as Copilot (12 billion parameters) or GPT3 (175 billion parameters).
For Copilot, a smaller token throughput is sufficient because it generates code output, which is usually more concise than human language. If we reduce the throughput requirement by 8 times, Copilot can also achieve the feasibility goal.
The final gap can be closed by a combination of greater parallelization, better implementations, and more efficient algorithms to boot in fully homomorphic encryption. At Ingonyama, we believe that algorithms are an important component to bridging this gap, and our team is currently focusing on the research and development of relevant algorithms.
Summarize
The combination of the security of fully homomorphic encryption and the computational power of large language models can redefine AI interactions, ensuring both efficiency and privacy. Although there are some challenges, through continued research and innovation, we can achieve a future where interactions with AI models such as ChatGPT are both instant and private. This will provide users with a more efficient and secure experience and promote the widespread application of AI technology in various fields.