Deep Thinking: Why is ChatGPT a fuzzy image of all text on the Internet?

As we all know, I am a GPT enthusiast and have integrated it into all aspects of my work and life. But GPT is not omnipotent. We need to recognize its essence to better use its capabilities. I highly recommend Ted Jiang's insightful article "ChatGPT is a fuzzy image of all text on the Internet". The unique insights are thought-provoking. I have summarized 3 points, welcome to read.
Ted Chiang is a Chinese-American science fiction writer who graduated from the Department of Computer Science at Brown University. His short story "The Story of Your Life" was adapted into the movie "Arrival" in 2016. His dual background in technology and science fiction gives him a unique insight into ChatGPT.
TL;DR
ChatGPT is a lossy compression of all text on the Internet. Beware of "beautiful blur". "Poorly expressed original ideas" are better than "unoriginal ideas expressed clearly". 1. ChatGPT is a lossy compression of all text on the Internet.
If all texts on the Internet are regarded as originals, considering the processing speed and accuracy, ChatGPT is actually a natural language interaction interface after lossy compression of these texts. Since it is lossy compression, some details and even key information will be discarded.
Regarding the problems that lossy compression may cause, the author gave a vivid example: In 2013, a German construction company copied a house floor plan, and each of the three rooms had a label to indicate its area: 14.13, 21.11 and 17.42 square meters. Then in the copy, all three rooms were marked as 14.13 square meters.
After investigation, it was found that the Xerox copier works by scanning documents into digital images before printing. In order to save space, a lossy compression format called jbig2 is used when scanning into digital images. The copier determines that the area labels of the three rooms are very similar, so it only stores one of them, and then reuses this label for all three rooms when printing.
Xerox copiers use a lossy compression format rather than a lossless format, which in itself is not a problem. The problem is that if the photo was just blurry, everyone would know it was not an exact replica of the original, but the copier produced a clear but inaccurate picture that could be misleading.
The author argues that this example needs to be kept in mind when we use OpenAI's ChatGPT and other similar large language models. ChatGPT preserves most of the information on the World Wide Web, just as JPEG preserves most of the information in high-resolution images. However, if you are looking for the exact sequence of bits, you can't find it, all you get is an approximation.
As we can see, in the latest authenticity evaluation of the OpenAI paper, although GPT-4 is much higher than previous models, there is still a high probability of generating wrong answers (especially in the fields of technology, code, and business), so we need to be careful.
2. Beware of “beautiful blur”
Our cognition of the world is essentially the reception and compression of information. We identify and discard unimportant information, keep important information, and exercise and use decision-making ability in the process. Both are lossy compression of information. What is the difference between us and ChatGPT? - Our compression of information is based on the understanding of facts, and what is left is "fuzzy correctness" - ChatGPT does not really "understand" information, and outputs "beautiful blur" based on statistical laws. Let's look at 2 more vivid examples:
If you ask ChatGPT to calculate 3457 * 43216, it will give the wrong answer 149299312 (correct answer 149397712). The last digit is correct because there are many multiplications ending in 6 and 7 for ChatGPT to learn, but because it does not really understand the principles of arithmetic, it gives the wrong answer in the end. Any analysis of the text will reveal that phrases like "lack of supply" often appear near phrases like "prices are rising" When asked about a lack of supply, the AI ​​may give a response that includes price increases. If the AI ​​has compiled a large number of correlations between economic terms, enough to provide reasonable answers to a wide variety of questions, should we say that it actually understands economic theory? Obviously not.
ChatGPT is good at producing beautiful answers, but beauty ≠ correctness. We must always keep this in mind. The results output by ChatGPT may be beautiful and clear but inaccurate. To identify them, we need to compare them with the originals, otherwise we may make wrong decisions based on fabricated content. The answer generated by bing below is a typical example of "beautiful fuzziness".
3. “Poorly expressed original ideas” are better than “clearly expressed unoriginal ideas”
There is a view that it is feasible to let the text generated by ChatGPT be used as a starting point for writers to create original works, so that they can focus on the truly creative parts. The author believes that using a vague and unoriginal work as a starting point is not a good way to create original works.
If you are a writer, you will write a lot of unoriginal writing before you write anything original. The time and energy spent on unoriginal work is not wasted. On the contrary, it is what enables you to eventually produce original writing. The time spent choosing the right words and rearranging sentences teaches you how to convey the meaning you want to convey through your writing.
Having students write essays is more than a way to test their mastery of the material; it gives them experience expressing their own ideas. If students never have to write essays that we have all read, they will never gain the skills needed to write about things we have never read.
So, once you are no longer a student, you can safely use the templates provided by large language models such as ChatGPT? However, no. The struggle to express your own ideas does not disappear after you graduate. This struggle will appear every time you start drafting a new article. Sometimes, it is only during the writing process that you can discover your original ideas, which is very critical.
Some might say that the output of a large language model doesn’t look much different from a human writer’s first draft, but that’s only a superficial resemblance. Your first draft isn’t “clearly expressed unoriginal ideas”; it’s “poorly expressed original ideas” accompanied by your amorphous dissatisfaction with your awareness of the distance between what it says and what you meant to say.
This is something that can guide you when rewriting, and it is something that is lacking when you start using AI-generated text. It is easy to lose ideas based on "unoriginal ideas expressed clearly"; starting from "poor expressions of original ideas" and gradually polishing them, you will eventually harvest "precise expressions of original ideas". Originality may become jade, and non-originality will only flow into floods.
Summary 2 points Take Away:
ChatGPT is a lossy compression of all texts on the Internet. We must always keep this in mind and be wary of mistaking "beautiful fuzziness" for accurate information, which will affect our judgment and decision-making. 2. Discover "original ideas" in struggles and poor expressions, while improving our ability to express ourselves and polishing them into jade. Train imagination, decision-making and communication skills to create competitiveness that machines cannot have.