With the rapid evolution of AI models, how to efficiently infer these large models has become a key issue that the industry cannot avoid. The open-source project vLLM from UC Berkeley not only directly confronts this technical challenge but also gradually establishes its own community and ecosystem, even giving rise to a startup focused on inference infrastructure called Inferact. This article will take you deep into the origins of vLLM, its technical breakthroughs, the development of the open-source community, and how Inferact aims to create a 'universal engine for AI inference.'

From academic experiments to GitHub star projects: The birth of vLLM

vLLM originally stemmed from a doctoral research project at UC Berkeley aimed at solving the low efficiency of inference in large language models (LLM). At that time, Meta open-sourced the OPT model, and Woosuk Kwon, one of the early contributors to vLLM, attempted to optimize the demo service of that model and discovered that it concealed an unresolved inference system challenge. "We thought it would only take a few weeks, but it opened up a whole new avenue for research and development," Kwon recalled.

Bottom-up challenges: Why is LLM inference different from traditional ML?

vLLM targets auto-regressive language models, whose inference processes are dynamic, asynchronous, and cannot be processed in batches, significantly differing from traditional image or speech models. The input length of such models can range from a single sentence to hundreds of pages of documents, requiring precise allocation of GPU memory, while computation steps (token-level scheduling) and memory management (KV cache handling) also become exceptionally complex.

An important technical breakthrough of vLLM is "Page Attention," a design that helps the system manage memory more effectively and cope with diverse requests and long sequence outputs.

Not just writing code: A key moment from campus to the open-source community

The vLLM team held its first open-source meetup in Silicon Valley in 2023, initially expecting only a handful of attendees; however, the number of registrations far exceeded expectations, crowding the venue and becoming a turning point in community development.

Since then, the vLLM community has grown rapidly, now with over 50 regular contributors and accumulating more than 2,000 GitHub contributors, making it one of the fastest-growing open-source projects today, receiving support from multiple parties such as Meta, Red Hat, NVIDIA, AMD, AWS, Google, and others.

Multiple forces competing in the arena: Building the "operating system for AI"

One of the keys to the success of vLLM is that it establishes a common platform for model developers, chip manufacturers, and application developers. They do not need to connect with each other; just connecting with vLLM is enough to achieve maximum compatibility between models and hardware.

This also means that vLLM is trying to create an "operating system for AI": allowing all models and all hardware to run on the same universal inference engine.

Is inference becoming increasingly difficult? The triple pressure of scale, hardware, and agent intelligence

Today's inference challenges are constantly upgrading, including:

  1. Model scale has exploded: from the initial hundreds of billions of parameters to today's trillion-scale models, the computational resources required for inference have also risen sharply.

  2. Model and hardware diversity: Although the Transformer architecture is consistent, internal details are becoming increasingly divergent, with variants like sparse attention and linear attention emerging endlessly.

  3. Rise of agent systems: Models are no longer just answering one round but participating in continuous dialogues, calling external tools, executing Python scripts, etc. The inference layer needs to maintain state for long periods and handle asynchronous inputs, further raising the technical threshold.

Entering practical application: Cases of vLLM being deployed on a large scale

vLLM is not just an academic toy; it has gone live on major platforms such as Amazon, LinkedIn, and Character AI. For example, Amazon's smart assistant "Rufus" is powered by vLLM, becoming the inference engine behind shopping searches.

Some engineers have deployed one of vLLM's features directly to hundreds of GPUs while still in the development phase, demonstrating the high level of trust in the community.

The company behind vLLM: The role and vision of Inferact

To promote further development of vLLM, core developers founded Inferact and received support from multiple investments. Unlike typical commercial companies, Inferact prioritizes open-source as its primary mission. One of the founders, Simon Mo, stated: "Our company exists to make vLLM the global standard inference engine." Inferact's business model revolves around maintaining and expanding the vLLM ecosystem while providing enterprise-level deployment and support, forming a dual-track parallel of business and open-source.

Inferact is actively recruiting engineers with experience in ML infrastructure, especially those skilled in large model inference, distributed systems, and hardware acceleration. For developers pursuing technical challenges and deep system optimization, this is an opportunity to participate in the next generation of AI infrastructure.

The team's expectation is to create an "abstraction layer" similar to an OS or database, allowing AI models to run seamlessly across diverse hardware and application scenarios.

This article builds the universal inference layer for AI! How the vLLM open-source project became the ambition of a global inference engine? First appeared in Chain News ABMedia.