GPU supply crisis: How AI startups can break through

MarsBit News · 2024-03-04T11:33:07.000Z
Original author: Mohit Pandit Original source: IOSG Ventures Summary GPU shortages are a reality, supply and demand are tight, but the number of underutilized GPUs can meet today's tight supply needs. An incentive layer is needed to facilitate cloud computing participation and then ultimately coordinate computing tasks for inference or training. The DePIN model fits exactly this purpose. The demand side finds this attractive because of the supply side incentives, because the computational cost is lower. Not everything is rosy, there are certain tradeoffs you have to make when choosing a Web3 cloud: like 'latency'. Compared with traditional GPU cloud, the trade-offs faced also include insurance, service level agreements (Service Level Agreements), etc.
Original article by: Mohit Pandit
Original source: IOSG Ventures
Summary
GPU shortages are a reality, with supply and demand tight, but there are enough underutilized GPUs to meet today's tight supply.
An incentive layer is needed to promote the participation of cloud computing and then eventually coordinate the computing tasks for inference or training. The DePIN model is well suited for this purpose.
The demand side finds this attractive because of the supply side incentives and because of the lower computational costs.
Not everything is rosy, and there are certain trade-offs that must be made when choosing a Web3 cloud: for example, latency. Compared to traditional GPU clouds, there are also trade-offs such as insurance and service level agreements.
The DePIN model has the potential to solve the GPU availability problem, but the fragmented model will not make the situation better. For a situation where demand is growing exponentially, fragmented supply is the same as no supply.
Given the number of new market players, market aggregation is inevitable.
introduction
﻿We are on the edge of a new era of machine learning and artificial intelligence. While AI has been around in various forms for some time (AI are computer devices that are told to perform things that humans can do, such as washing machines), we are now witnessing the emergence of sophisticated cognitive models capable of performing tasks that require intelligent human behavior Task. Notable examples include OpenAI’s GPT-4 and DALL-E 2, and Google’s Gemini. In the rapidly growing field of artificial intelligence (AI), we must recognize the dual aspects of development: model training and inference. Inference includes the functions and outputs of an AI model, while training includes the complex process (including machine learning algorithms, data sets, and computing power) required to build intelligent models. Taking GPT-4 as an example, all the end user cares about is inference: getting output from a model based on text input. However, the quality of this inference depends on model training. In order to train effective AI models, developers need access to comprehensive underlying data sets and huge computing power. These resources are mainly concentrated in the hands of industry giants including OpenAI, Google, Microsoft and AWS. The formula is simple: better model training >> leads to enhanced inference capabilities of the AI ​​model >> thereby attracting more users >> leading to more revenue and consequently increased resources for further training. These major players have access to large underlying data sets and, crucially, control large amounts of computing power, creating barriers to entry for emerging developers. As a result, new entrants often struggle to acquire sufficient data or leverage the necessary computing power at an economically feasible scale and cost. With this scenario in mind, we see that networks have great value in democratizing access to resources, primarily related to accessing computing resources at scale and reducing costs.
GPU Supply Issues
NVIDIA CEO Jensen Huang said at CES 2019 that "Moore's Law is over". Today's GPUs are extremely underutilized. Even during deep learning/training cycles, GPUs are not fully utilized. Here are typical GPU utilization numbers for different workloads:
Idle (just booted into Windows): 0-2%
General production tasks (writing, simple browsing): 0-15%
Video playback: 15 - 35%
PC games: 25 - 95%
Graphic design/photo editing active workloads (Photoshop, Illustrator): 15 - 55%
Video Editing (Active): 15 - 55%
Video Editing (Rendering): 33 - 100%
3D Rendering (CUDA/OptiX): 33 - 100% (often misreported by Win Task Manager - use GPU-Z)
Most consumer devices with GPUs fall into the first three categories.
GPU runtime utilization %. Source: Weights and Biases
The above situation points to a problem: poor utilization of computing resources. There is a need to better utilize the capacity of consumer GPUs, even when GPU utilization peaks, it is suboptimal. This clearly defines two things to do in the future:
Resource (GPU) aggregation
Parallelization of training tasks
In terms of the types of hardware that can be used, there are 4 types currently available:   Datacenter GPUs (e.g., Nvidia A100s)   Consumer GPUs (e.g., Nvidia RTX3060)   Custom ASICs (e.g., Coreweave IPU)   Consumer SoCs (e.g., Apple M2)
Beyond ASICs (since they are built for a specific purpose), other hardware can be pooled to be most efficiently utilized. With many of these chips in the hands of consumers and datacenters, a DePIN model that aggregates the supply side may be the way to go. GPU production is a pyramid of volume; consumer GPUs have the highest volume, while premium GPUs like NVIDIA A100s and H100s have the lowest volume (but higher performance). These premium chips cost 15x more to produce than consumer GPUs, but sometimes don’t offer 15x the performance. The entire cloud computing market is worth about $483 billion today and is expected to grow at a CAGR of about 27% over the next few years. By 2023, there will be about 13 billion hours of ML compute demand, which at current standard rates equates to about $56 billion in spending on ML compute in 2023. This entire market is also growing rapidly, doubling every 3 months.
GPU Requirements
Computing demand mainly comes from AI developers (researchers and engineers). Their main requirements are: price (low-cost computing), scale (large amounts of GPU computing), and user experience (easy access and use). In the past two years, GPUs have been in huge demand due to the increase in demand for AI-based applications and the development of ML models. Developing and running ML models requires:
Heavy computation (from access to multiple GPUs or data centers)
Capable of performing model training, fine tuning, and inference, with each task deployed on a large number of GPUs for parallel execution
Computing-related hardware spending is expected to grow from $17 billion in 2021 to $285 billion in 2025 (approximately 102% CAGR), and ARK expects computing-related hardware spending to reach $1.7 trillion by 2030 (43% CAGR).
ARK Research
With a large number of LLMs in the innovation phase, competition driving the need for computation for more parameters, and retraining, we can expect a continued demand for high-quality computation in the coming years.
As new GPU supplies tighten, where does blockchain fit in?
When resources are insufficient, the DePIN model can provide help:
Start the supply side and create a large supply
Coordination and completion of tasks
Ensure tasks are completed correctly
Properly reward providers for work done
Aggregating any type of GPU (consumer, enterprise, high performance, etc.) can be problematic in terms of utilization. A100 chips should not perform simple computations when the computational tasks are fragmented. GPU networks need to decide what type of GPUs they think should be included in the network, based on their market entry strategy. When the computational resources themselves are distributed (sometimes globally), a choice needs to be made by the user or the protocol itself, which type of computational framework will be used. Providers like io.net allow users to choose from 3 computational frameworks: Ray, Mega-Ray, or deploying a Kubernetes cluster to perform computational tasks in containers. There are more distributed computational frameworks like Apache Spark, but Ray is the most commonly used. Once the selected GPU completes the computational task, the output is reconstructed to give a trained model. A well-designed token model will subsidize the computational costs for GPU providers, and many developers (demand side) will find such a scheme more attractive. Distributed computing systems are inherently latency-prone. There is computational decomposition and output reconstruction. So developers need to make a trade-off between the cost-effectiveness of training a model and the time required.
Does a distributed computing system need its own chain?
The network operates in two ways:
Charge by task (or computing cycle) or charge by time
Charge by time unit
First, one could build a proof-of-work chain similar to what Gensyn is attempting, where different GPUs share the “work” and are rewarded for it. For a more trustless model, they have the concept of validators and informants who are rewarded for keeping the integrity of the system, based on proofs generated by solvers. Another proof-of-work system is Exabits, which instead of task splitting treats its entire network of GPUs as a single supercomputer. This model seems more suitable for large LLMs. Akash Network has added GPU support and is starting to aggregate GPUs into this space. They have an underlying L1 to reach consensus on state (showing work completed by GPU providers), a marketplace layer, and container orchestration systems like Kubernetes or Docker Swarm to manage the deployment and scaling of user applications. A proof-of-work chain model would work best if the system is trustless. This ensures coordination and integrity of the protocol. On the other hand, systems like io.net do not build themselves as a chain. They choose to solve the core problem of GPU availability and charge customers by time unit (per hour). They do not need a verifiability layer because they are essentially “renting” GPUs to use as they please for a specific lease period. There is no task splitting in the protocol itself, but it is done by developers using open source frameworks like Ray, Mega-Ray or Kubernetes.
Web2 and Web3 GPU Cloud
Web2 has many players in the GPU cloud or GPU as a service space. The main players in this space include AWS, CoreWeave, PaperSpace, Jarvis Labs, Lambda Labs, Google Cloud, Microsoft Azure, and OVH Cloud. This is a traditional cloud business model where customers rent a GPU (or multiple GPUs) by time unit (usually one hour) when they need computing. There are many different solutions for different use cases.
The main differences between Web2 and Web3 GPU clouds are in the following parameters:
1. Cloud setup costs
Due to token incentives, the cost of setting up a GPU cloud is significantly reduced. OpenAI is raising $1 trillion for the production of computing chips. It appears that without token incentives, it would take at least $1 trillion to defeat the market leader.
2. Calculation time
Non-Web3 GPU clouds will be faster because the rented GPU clusters are located within a geographic region, while the Web3 model is likely to have a more widely distributed system, and latency can come from inefficient problem partitioning, load balancing, and most importantly, bandwidth.
3. Calculate costs
Due to token incentives, the cost of Web3 computing will be significantly lower than the existing Web2 model. Calculation cost comparison:
These numbers may change as more clusters are provisioned and utilized to offer these GPUs. Gensyn claims to offer A100s (and their equivalents) for as little as $0.55 per hour, and Exabits promises a similar cost-saving structure.
4. Compliance
Compliance is not easy in a permissionless system. However, Web3 systems such as io.net and Gensyn do not position themselves as permissionless systems. Compliance issues such as GDPR and HIPAA are handled during the GPU onboarding, data loading, data sharing, and result sharing stages.
ecosystem
Review、io.net、Exabits、Akash
risk
1. Demand Risk I think the top LLM players will either continue to accumulate GPUs or use GPU clusters like NVIDIA's Selene supercomputer, which has a peak performance of 2.8 exaFLOP/s. They will not rely on consumers or long-tail cloud providers to pool GPUs. Currently, the top AI organizations compete on quality more than cost. For non-heavy ML models, they will seek cheaper computing resources, such as blockchain-based token-incentivized GPU clusters that can provide services while optimizing existing GPUs (the above is an assumption: those organizations prefer to train their own models rather than use LLM)
2. Supply Risk With the massive amount of capital invested in ASIC research, and inventions like the Tensor Processing Unit (TPU), this GPU supply problem may go away on its own. If these ASICs can provide a good performance:cost tradeoff, then existing GPUs hoarded by large AI organizations may return to the market. Do blockchain-based GPU clusters solve a long-term problem? While blockchain can support any chip besides GPUs, what the demand side does will completely determine the direction of projects in this space.
in conclusion
A fragmented network with small GPU clusters will not solve the problem. There is no place for a “long tail” of GPU clusters. GPU providers (retail or smaller cloud players) will gravitate towards larger networks because the incentives are better. It will be a function of a good token model and the ability of the supply side to support multiple compute types. GPU clusters may see a similar aggregation fate as CDNs. If large players are to compete with existing leaders like AWS, they may start sharing resources to reduce network latency and geographic proximity of nodes. If the demand side grows larger (more models to train, more number of parameters to train), Web3 players must be very aggressive in supply side business development. If too many clusters compete from the same customer base, there will be a fragmented supply (which invalidates the whole concept) while demand (in TFLOPs) grows exponentially. Io.net has already emerged from the crowd of competitors and started with an aggregator model. They have aggregated GPUs from Render Network and Filecoin miners, providing capacity while also bootstrapping supply on their own platform. This may be the winning direction for DePIN GPU clusters.
GPU supply crisis: How AI startups can break through

Explore More From Creator

Latest News