Article reprint source: Model Evolution
Original source: Tech Planet
Image source: Generated by Unbounded AI
Zheng Wen still remembers the afternoon a few months ago when she earned 20 cents in an hour. She graduated from a technical college in Hunan and is a large model data annotator. Her daily work is not complicated - adding labels to the raw data (such as images, videos, texts, etc.) she receives.
But the large model has very high requirements for data quality. On that day, a picture was repeatedly revised 8 times before it was passed. The whole revision process took an hour. In other words, she only earned 0.2 yuan in this hour, while under normal circumstances, she could earn 12 yuan and draw 600 frames. "The money is not easy to make," she repeatedly emphasized.
This is the consensus of almost all data annotation practitioners. On one side of data annotation, practitioners earn less than 5,000 yuan a month, and they build the foundation of large models like ants. On the other side, Internet giants dream of AI, hoping to surpass Chat GPT 4.
Data labeling uses the most primitive piecework system to calculate wages, and there is no intrigue in the workplace. The only trouble is that this is too boring work, which makes it difficult for most of them to stick to it for three months. And almost everyone told Tech Planet that you'd better not go.
But what they don’t know is that before long, most of them may lose this boring job because those simple data annotations will be replaced by AI.
From 50 cents to 4 cents, the price plummeted
Lin Shuang made a lot of "quick money" in 2017: more than 6,000 yuan in 15 days. For a college graduate, this income was really considerable. At that time, people's expectations for AI were skyrocketing. Almost no one doubted its future. All investment institutions firmly believed that companies with a scale of billions, tens of billions, or even hundreds of billions could be born here.
Behind almost all AI technologies is the competition of algorithms, computing power, and data. Huge amounts of data are the basis of the quality of technology. Programmers with glamorous backgrounds sit in offices in Beijing, Shanghai, and Guangzhou, drawing AI blueprints through code iteration algorithms, while college students and mothers work in cubicles in third- and fourth-tier cities, processing pictures, text, and voice in huge data packets.
ChatGPT is no exception. An employee of the Baidu Wenxin Yiyan project team said that the big model itself does not have any new technology, nor does it have too high a technical barrier. The key issue is the parameter barrier formed by the computing power barrier.
There is not much difference between data labelers in the era of big models and those in the past. The only difference may be a more comfortable office environment and higher requirements for labeling quality. A data labeling practitioner told Tech Planet that when they first enter the industry, they usually form a team of about 10 people, one of whom is responsible for quality inspection. If the work is not up to standard, the employee will be asked to send it back for redoing. The quality of the data determines the quality of the big model.
Data workers don’t care what new branches of AI technology have. They care more about the unit price because wages are calculated on a piece-rate basis.
"When the unit price was high, it cost more than 10 cents to draw a 2D frame. At the highest point, I worked for more than 10 hours and earned more than 600 yuan a day," Lin Shuang recalled. However, this was not the highest. A labeler said that the price of 2D frame drawing in the early days could be as high as 50 cents.
Frame drawing is a common operation in data annotation. The annotator marks the objects in the picture, such as vehicles, red lights, obstacles, etc. according to the requirements. Frame drawing is divided into 2D and 3D, and the latter is more expensive.
But this enthusiasm did not last long. With more and more people pouring in and the overall development of the AI industry not being smooth, the unit price of labeling an image has become lower and lower. Lin Shuang said that the lowest price is only 4 cents now.
"If it's a pull-frame, the average unit price in the industry is around 0.15 yuan, but it still depends on the project. If you can get orders yourself, the minimum requirement for getting a first-hand order should be 100 employees. That's a large scale, and the 3D frame may be 30 cents each, but it's rare to get 50 cents."
Of course, if you have professional knowledge in medicine or finance, the price will be higher. For example, many large medical models require the annotator to have clinical expertise and relevant work experience.
Most practitioners earn no more than 5,000 yuan a month, but there are a few lucky ones. Yang Shuo used to run a clothing store in Sichuan, but the epidemic affected his business. He switched to large-scale model data annotation this year. Now, he earns 8,000 yuan a month. "I signed a contract with the company and paid a franchise fee of 9,500 yuan. The contract states that the minimum monthly income is 7,000 yuan."
Who made the money?
Internet giants such as Alibaba, Tencent, and ByteDance, as well as car companies such as SAIC and Lynk & Co., are the sources of data labeling business distribution. If data labeling companies want to obtain orders directly from the source at the best price, they need to have a certain scale.
An employee of a data labeling company told Tech Planet that they got orders directly from large companies, but the large companies required them to have 500 people, so they would choose to meet the personnel requirements through franchising or subsidiaries.
The difference between the two is that franchising is suitable for beginners to set up a studio. If you want to set up a subsidiary, there is usually only one in a region. Xiaobai Studio needs to charge a franchise fee of 25,000 or 30,000. A subsidiary is the exclusive agent in a region and needs to pay a fee of 50,000. They can guarantee sufficient orders within three years and are responsible for technical training within three years. These studios or subsidiaries form a large union, ranging from hundreds to thousands.
An employee of the above-mentioned data labeling company said that the popularity of big models has once again pushed the data labeling industry into a boom, and now people visit their company almost every day.
But in fact, it is not easy to run a data labeling company. What data labeling companies tell you is that this industry is difficult to do in the first 1 to 2 months, because employees need a ramp-up period. In the early stage, only 5-8 people are enough, and even aunts in their 40s can do it.
Stability is the most important factor for data labeling companies or studios. However, most of the labeling employees that Tech Planet has come into contact with often leave their jobs at the speed of light within 3 months because of boredom. New employees cannot start working immediately. The result of high staff turnover is that the quality and cycle of data labeling are not stable enough. Money-strapped mothers are the most popular group for data labeling studios to recruit.
"It is definitely not possible to find a part-time job. There will be a period of inactivity, and you will lose money after investing in rent and computers. The best way is for everyone to work in the office," Wei Ming, who has opened a data labeling studio, told Tech Planet.
The payment cycle of most data labeling companies starts from 3 months and lasts for half a year at most. However, they need to pay their employees monthly, which requires a certain amount of capital reserves. "3,500 per person, 100 people, that's 1.05 million in 3 months."
Zhang Jian once joined a union with more than 200 employees. In the first year, they caught up with the boom period of the industry, and the unit price of 2D pull frames was as high as 50 cents. That year, his union earned more than 4 million.
But in the second year, the market took a sharp turn for the worse. The price per unit of labeling became lower, the turnover of employees was faster, the idle time increased, and the two major projects were not settled. After a whole year, they lost more than 3 million. "The boss said that he would never touch data labeling in the short term," Zhang Jian said, "They are now suing the upstream company."
This is a business with meager profits. Haitian Ruisheng is the first main board listed company in the data annotation industry. Last year, the company had a revenue of 263 million yuan and a profit of only 29.45 million yuan, with a net profit margin of just over 10%. However, in the first half of this year, due to a decrease in the number of customers, the company fell into losses.
"Screws" that may be replaced at any time
Relying on the accumulation of Kenyan workers, OpenAI's language dialogue model capability finally stood out. These ordinary people, known as data migrant workers, supported Sam Altman's (OpenAI founder) AI dream, but if nothing unexpected happens, most of their work will soon be replaced by new products they have helped create.
Abroad, Anthropic, founded in 2021 by former Open AI employees, has raised $5.15 billion this year, more than seven times its total financing in the past two years. The company provides a new method to train models with less human involvement.
This year, AI startup refuel launched an open source tool called Autolabel, which can use the mainstream large models on the market to label data sets. The company's test results show that Autolabel's labeling efficiency is 100 times higher than manual labeling, and the cost is only 1/7 of the manual cost.
In China, a company called Vision Future is also building a large annotation model. In an interview, they said that some projects have been delivered using GPT, with an accuracy rate of more than 80%, close to that of manual work.
However, Haitian Ruisheng believes that AI will never achieve fully automated labeling, because if machines want to continue to evolve and become closer to human judgment and understanding, they must be guided by humans.
Almost all people who have worked on data labeling have told Tech Planet the same view: data labeling is a job with no threshold and all you need is to be proficient in using a computer.
But in fact, if simple labeling can be done with AI, then human participation will be in more difficult data screening and standard work, which also means that the threshold of the industry will continue to increase, especially for large language models such as ChatGPT and Wenxin Yiyan.
In contrast, long before ChatGPT became popular, OpenAI had assembled more than a dozen doctoral students to "label". Baidu's data labeling base in Haikou has hundreds of full-time large-model data labelers, and the labelers have a bachelor's degree rate of 100%.
The characteristic of this type of large language model is that annotators need to have a certain amount of knowledge and logical analysis ability. According to Caijing Eleven, annotators need to determine the type of question, and then score and rank the five answers separately, with a score range of 0-5. If the score is lower than 3 points, the specific reason must be marked, such as "irrelevant answer (0 points)", "seriously off-topic (1 point)", "logical problems and factual errors, but the proportion is small, so 2 points", etc.
Another hot area for data annotation is autonomous driving. According to a report by Deloitte, the demand for annotation in the field of autonomous driving will account for 38% of the entire AI downstream application in 2022, and it is expected that the proportion will rise to 52% by 2027. Compared with large language models, for models in the field of autonomous driving, those simple operations of pulling boxes still have relatively loose academic requirements.
Labelers are the cornerstone of human beings' transition from the mobile Internet era to the artificial intelligence era. Most of the practitioners that Tech Planet has come into contact with are not aware of the changes that AI will bring to them, nor do they know their contributions to the development of AI. They are just the new generation of screws in the Internet era and may be replaced at any time.
(Note: All the characters in this article are pseudonyms.)
