Article reprinted from: Machine Heart
Unexpectedly, ByteDance’s big model project was exposed in this way.
Original source: Machine Heart
Image source: Generated by Unbounded AI
Last weekend, foreign media reported that ByteDance’s account was disabled for violating OpenAI’s terms of service when it was using OpenAI technology to develop its own large language model.
According to The Verge, the large language model project that ByteDance is developing internally is called "Project Seed".
Since training large models requires a lot of question-answering knowledge, the project was revealed to have been secretly using OpenAI's technology to enrich the data set.
In the field of big models, the "cheating" behavior of using other AI-generated content for training is not uncommon, but it is often considered to be a behavior that crosses the line. On ChatGPT, the misuse of AI-generated data directly violates OpenAI's terms of service, which stipulate that its model output cannot be used to "develop any artificial intelligence model that competes with our products and services."
In the November 14th update of OpenAI's terms for ChatGPT and DALL・E, it also stipulates that users:
You may not reverse engineer, decompile, or engage in model extraction or theft, including models and systems;
You may not extract the generated content by automatic or programmatic means;
You may not disguise ChatGPT-generated content as human-generated content.
OpenAI's approach to dealing with violating users is to terminate services after notification.
Full Agreement: https://openai.com/policies/business-terms
So, what is the specific content of ByteDance's "Seed Project" and how is it suspected of violating OpenAI's terms of use?
According to internal documents obtained by The Verge, ByteDance used OpenAI's technology more in the early stages of the "Seed Project" and instructed the team to stop using GPT-generated text at any stage of model development about a few months ago. Around the same time, ByteDance released its own AI large model Doubao.
The relevant employees were very aware of their own behavior and discussed how to circumvent it through "data desensitization." However, they still often reached the maximum access limit of the OpenAI API.
On Friday local time, OpenAI said that ByteDance's account had been suspended.
OpenAI spokesperson Niko Felix said in a statement to The Verge, “All API customers must comply with OpenAI’s Terms of Use to ensure that our technology is used appropriately. While ByteDance’s use of our API is minimal, we have suspended their account while we investigate further. If it is ultimately found that ByteDance’s use is not in compliance with policy, they will be asked to make necessary changes or terminate their account.”
ByteDance spokesperson Jodi Seth responded, denying any wrongdoing by the company and clarifying that it had obtained permission to use the GPT API.
She said, "ByteDance has obtained authorization from Microsoft to use the GPT API. The data generated by GPT was only used to annotate the model in the early development of the seed program and was removed from ByteDance's training data in the middle of this year. We use GPT to support products and features in non-Chinese markets, and use our own models to support Doubao in the Chinese market."
Image source: https://the-decoder.com/openai-bans-tiktok-company-bytedance-from-chatgpt-due-to-possible-data-theft/
At the same time, Microsoft spokesman Frank Shaw also issued a statement, "AI solutions like Azure OpenAI services are part of our limited access framework, and all customers must apply for and be approved by Microsoft before they can access them. We set standards and provide resources to help customers use these technologies responsibly and comply with relevant terms of service. We also have processes in place to detect abuse and stop companies from accessing our services when they violate our guidelines."
On December 17, a relevant person in charge of ByteDance responded to Machine Intelligence’s request for comment, saying that the company emphasized the need to abide by OpenAI’s terms of use when using its services, and that it was in contact with OpenAI to clarify any misunderstandings that may have been caused by external reports.
Here is ByteDance’s description of its use of OpenAI services:
1. At the beginning of this year, when the technical team just started to explore the large model, some engineers applied GPT's API service to experimental project research on a smaller model. The model was only tested, not planned to be launched, and never used externally. After the company introduced GPT API call specification checks in April, this practice has been stopped.
2. As early as April this year, the ByteDance model team had put forward clear internal requirements that the data generated by the GPT model should not be added to the ByteDance model's training data set, and trained the engineering team to comply with the terms of service when using GPT.
In September, the company conducted another round of internal inspections and took measures to further ensure that the API calls to GPT met the regulatory requirements. For example, batch sampling was used to test the similarity between model training data and GPT to prevent data labelers from using GPT privately.
4. In the next few days, we will conduct another comprehensive review to ensure strict compliance with the terms of use of relevant services.
Since the emergence of ChatGPT, major technology companies have been stepping up the development of competing products that can match it. However, due to the C-end and overseas markets, facing more technical and regulatory challenges, ByteDance has been relatively low-key in promoting large models. In June of this year, Volcano Engine released the large model platform Volcano Ark. In August, ByteDance's self-developed large model "Skylark" passed the filing and started the external testing of the AI dialogue product "Doubao".
In terms of technology and practical applications, generative AI has made great progress this year, but people still have some concerns about issues such as security and privacy protection.
References:
https://www.theverge.com/2023/12/15/24003151/bytedance-china-openai-microsoft-competitor-llm
https://www.businessinsider.com/bytedance-openai-tech-artificial-intelligence-tiktok-sam-altman-2023-12