Musk open-sources Grok-1: 314 billion parameters, the largest so far, fully open weight architecture, magnetic download

Article reprinted from: Machine Heart
Source: Synced
The open source community is blessed.
Image source: Generated by Unbounded AI
As promised, Musk’s open source version of the big model Grok is finally here!
Early this morning, Musk's large model company xAI announced the official open source of the 314 billion parameter mixture of experts (MoE) model "Grok-1", as well as the model's weights and network architecture.
This also makes Grok-1 the current open source large language model with the largest number of parameters.
Cover image generated using Midjourney following a Grok prompt: 3D illustration of a neural network, with transparent nodes and glowing connections, showing different weights as connecting lines of varying thickness and color.
At this time, Musk certainly did not forget to mock OpenAI, "We want to learn more about the open parts of OpenAI."
Back to the model itself, Grok-1 was trained from scratch and was not fine-tuned for any specific application (such as conversations). In contrast, the large Grok model available on X (formerly Twitter) is a fine-tuned version, and its behavior is different from the original weight version.
The model details of Grok-1 include the following:
The base model is trained on a large amount of text data and is not fine-tuned for any specific task;
314 billion parameter MoE model with 25% activation weight on a given token;
In October 2023, xAI was trained from scratch using a custom training stack consisting of the JAX library and the Rust language.
xAI complies with the Apache 2.0 license to open source the weights and architecture of Grok-1. The Apache 2.0 license allows users to freely use, modify, and distribute the software for both personal and commercial purposes. In just four hours after the project was released, it has already received 3.4k stars, and the popularity is still increasing.
Project address https://github.com/xai-org/grok-1
This repository contains JAX sample code for loading and running the Grok-1 open weight model. Before using it, users need to make sure to download the checkpoint first, and place the ckpt-0 directory in the checkpoint. Then, run the following code to test:
pip install -r requirements.txtpython run.py
The project description clearly states that since Grok-1 is a large (314B parameters) model, a machine with sufficient GPU memory is required to test the model using the sample code. In addition, the implementation of the MoE layer in this repository is not very efficient and was chosen to avoid the need for a custom kernel to verify the correctness of the model.
Users can use a Torrent client and this magnet link to download the weights file:
magnet :? 2F%2Ftracker.opentrackr.org %3A1337%2Fannounce
Seeing this, some netizens began to wonder what kind of configuration is needed to run Grok-1 with 314B parameters. Someone gave the answer: it may require a machine with 628 GB GPU memory (2 bytes per parameter). So, 8xH100 (80GB each) will be enough.
Sebastian Raschka, a well-known machine learning researcher and author of the best-selling book "Python Machine Learning", commented: "Grok-1 is more open source than other open weight models that usually come with usage restrictions, but it is not as open source as Pythia, Bloom, and OLMo, which come with training code and reproducible datasets."
DeepMind research engineer Aleksa Gordié predicts that Grok-1 should be more powerful than LLaMA-2, but it is not clear how much data is contaminated. In addition, the number of parameters of the two is not of the same order of magnitude.
Another Twitter user @itsandrewgao analyzed the architectural details of Grok-1 in detail and made the following conclusions.
First, Grok-1 is a mixture of 8 experts (2 active), has 86 billion activation parameters (70B more than Llama-2), and uses rotated embeddings instead of fixed position embeddings.
Tokenizer vocabulary size is 131,072 (similar to GPT-4) 2^17, embedding size is 6,144 (48*128), 64 transformer layers (sheesh), each with a decoder layer: multi-head attention block and dense block, key-value size 128.
Multi-head attention block: 48 heads for query, 8 for key/value (KV). KV size is 128. Dense block (Dense Feedforward block): widening factor 8, hidden layer size 32768. Each token selects 2 experts from 8.
The rotation position embedding size is 6144, the same as the input embedding size. The context length is 8192 tokens and the precision is bf16.
Some weighted 8-bit quantized content is also provided.
Of course, we still hope that xAI officials can announce more model details of Grok-1 as soon as possible.
What model is Grok-1 and what are its capabilities?
Grok is a large language model launched by Musk's xAI team last November. In the official blog of last November (see "Musk xAI announces detailed progress of large model, Grok was only trained for 2 months"), xAI wrote:
Grok is an AI modeled after The Hitchhiker's Guide to the Galaxy that can answer almost any question, and even better, it can suggest questions for you to ask! Grok is slightly witty and rebellious in its responses, so if you hate humor, don't use it! A unique and fundamental advantage of Grok is that it learns about the world in real time through the X platform. It can also answer poignant questions that are rejected by most other AI systems. Grok is still a very early beta product - this is the best we've been able to achieve with two months of training - so hopefully it will improve quickly in testing with your help.
xAI said that the research and development of Grok-1 took four months, during which time Grok-1 went through multiple iterations.
After announcing the creation of xAI, they trained a 33 billion parameter LLM prototype, Grok-0. This early model approached the capabilities of LLaMA 2 (70B) on the standard LM test benchmark, but used only half the training resources. They then made significant improvements to the model's reasoning and encoding capabilities, ultimately developing Grok-1, a more powerful SOTA language model that achieved 63.2% on the HumanEval encoding task and 73% on MMLU.
xAI conducted a series of evaluations on Grok-1 using a number of standard machine learning benchmarks designed to measure mathematical and reasoning capabilities:
In these benchmarks, Grok-1 showed strong performance, surpassing all other models in its computational class, including ChatGPT-3.5 and Inflection-1. Only models trained with large amounts of training data and computational resources like GPT-4 can surpass it. xAI said this demonstrates their rapid progress in efficiently training LLMs.
However, xAI also stated that since these benchmarks can be found online, they cannot rule out that the model was accidentally trained on this data. Therefore, after collecting the dataset, they manually scored their model (as well as the Claude-2 and GPT-4 models) based on the 2023 Hungarian National High School Mathematics Final Exam published at the end of May last year (after the data cutoff date). As a result, Grok passed the exam with a C grade (59%), Claude-2 also achieved a similar score (55%), and GPT-4 received a B grade with a score of 68%. xAI stated that they did not prepare or adjust the model specifically for this exam.
The following table shows more information about Grok-1 (from a blog post in November 2023, some information may have been updated):
Model details: Grok-1 is a Transformer-based autoregressive model. xAI fine-tuned the model using a lot of feedback from humans and the earlier Grok-0 model. The initial Grok-1 is able to handle a context length of 8192 tokens. The model was released in November 2023.
Intended use: Grok-1 will serve as the engine behind Grok for natural language processing tasks including question answering, information retrieval, creative writing, and coding assistance.
Limitations: While Grok-1 performs well at processing information, it is critical to have humans review Grok-1’s work to ensure accuracy. The Grok-1 language model does not have the ability to independently search the web. Deploying search tools and databases into Grok could enhance the model’s capabilities and realism. Despite having access to external information sources, the model can still hallucinate.
Training data: The training data used for the Grok-1 release comes from internet data as of the third quarter of 2023 and data provided by xAI’s AI trainers.
Evaluation: xAI evaluated Grok-1 on a range of reasoning benchmark tasks and foreign math exam questions. They worked with early alpha testers to evaluate a version of Grok-1, including adversarial testing. Currently, Grok has opened closed testing access to a portion of early users to further expand the testing population.
In the blog, xAI also announced the construction engineering work of Grok and the general research direction of xAI. Among them, long-context understanding and retrieval, and multimodal capabilities are among the directions that will be explored in the future.
xAI said their vision for building Grok is to create some AI tools to help humans seek understanding and knowledge.
Specifically, they hope to achieve the following goals:
Gather feedback to ensure that the AI ​​tools they build can maximize the benefits of all humanity. They believe it is important to design AI tools that are useful to people of all backgrounds and political views. They also hope to empower users through their AI tools while complying with the law. Grok's goal is to explore and publicly demonstrate this approach;
Empowering research and innovation: They want Grok to be a powerful research assistant for everyone, helping them quickly access relevant information, process data, and come up with new ideas.
Their ultimate goal is for their AI tools to aid people in their quest for understanding.
On the X platform, the open source of Grok-1 has sparked a lot of discussion. Notably, the technical community pointed out that the model uses GeGLU in the feedforward layer and uses an interesting sandwich norm technique for normalization. Even OpenAI employees posted a post expressing their interest in the model.
However, there are some things that the open source version of Grok cannot do at present, such as "understanding the world in real time through the X platform". To achieve this function, you still need to subscribe to the paid version deployed on the X platform.
Given Musk's positive attitude towards open source, some technicians are already looking forward to the open source of subsequent versions.