Article reprint source: AIGC

Source: Quantum Bit

Image source: Generated by Unbounded AI

A recent study by Microsoft caused Llama 2 to suffer from selective amnesia, completely forgetting everything about Harry Potter.

Now ask the model "Who is Harry Potter?" and it will answer like this:

There is no Hermione, Ron, or Hogwarts...

You should know that Llama 2's memory depth was very powerful. For example, if you give it a seemingly ordinary prompt "That autumn, Harry Potter returned to school", it can continue to tell the story of the magical world created by J.K. Rowling.

And now the specially fine-tuned Llama2 has completely forgotten about the magical Harry.

What on earth is going on?

Harry Potter Forgotten Project

Traditionally, it is relatively simple to “feed” new data to a large model, but it is not so easy to make the model “spit out” the data it has “eaten” and forget some specific information.

Because of this, the large models trained with massive amounts of data have "accidentally consumed" too much copyrighted text, toxic or malicious data, inaccurate or false information, personal information, etc. In the output, the model intentionally or unintentionally revealed this information, which caused huge controversy.

Take ChatGPT for example, it has been involved in many lawsuits.

Previously, 16 people anonymously sued OpenAI and Microsoft, claiming that they used and leaked personal privacy data without permission, and the amount of compensation was as high as 3 billion US dollars. Then two full-time authors claimed that OpenAI used their novels to train ChatGPT without permission, which constituted infringement.

To solve this problem, one can choose to retrain the model from scratch, but this is very costly. Therefore, finding a way to "make the model forget specific information" has become a new research direction.

Microsoft researchers Ronen Eldan and Mark Russinovich recently published a study on how to successfully eliminate subsets of model training data.

In the experiment, the researchers used the Llama2-7b base model, which was trained on the "books3" dataset, which includes the Harry Potter series and other series of novels written by J.K. Rowling.

They proposed a fine-tuning method that makes large models forget, completely changing the model's output.

For example, when asked who Harry Potter is, the original Llama2-7b basic model can give the correct answer, and the fine-tuned model, in addition to the answer shown at the beginning, actually discovered the hidden identity behind Harry Potter - a British actor, writer and director...

When asked, "Who are Harry Potter's two best friends?", the original Llama2-7b base model was still able to give the correct answer, but the fine-tuned model answered:

Harry Potter's two best friends are a talking cat and a dinosaur. One day, they decide to...

Although it’s nonsense, it seems very “magical”, right? (dog head):

Here are some other comparisons, showing that after fine-tuning Llama2-7b, forgetting is indeed achieved:

So how is this done?

Three steps to erase specific information

The key to achieving selective amnesia in a model is to pick out the information you want to forget.

Here, the researchers took Harry Potter as an example and performed a reverse operation - using reinforcement learning methods to further train the basic model.

That is to say, let the model study the Harry Potter series of novels carefully again, and thus obtain a "reinforced model".

The enhanced model naturally has a deeper and more accurate understanding of Harry Potter than the basic model, and its output will be more inclined towards the content in the Harry Potter novels.

The researchers then compared the logits (a way to represent the probability of an event) of the enhanced model and the base model to find the words most related to the "forgetting target", and then used GPT-4 to pick out specific expressions in the novel, such as "wand" and "Hogwarts".

In the second step, the researchers replaced these specific expression words with common words and asked the model to predict the words that would appear later through the replaced text as a general prediction.

In the third step, the researchers combined the enhanced model predictions with the general predictions.

That is, we go back to the unreplaced Harry Potter novel text and let the model predict the following words based on the previous part, but this time we ask it to predict the common words mentioned above, rather than the specific magic words in the original book, thus generating a universal label.

Finally, we fine-tune the base model using the original unreplaced text as input and the universal label as target.

Through repeated training and gradual correction, the model gradually forgets the magical knowledge in the book and produces more ordinary predictions, thus achieving the forgetting of specific information.

△The probability of the next word being predicted: The probability of the word "magic" gradually decreases, and the probability of common words such as "at" increases

To be precise, the method used by the researchers here is not to make the model forget the name "Harry Potter", but to make it forget the connection between "Harry Potter" and "magic", "Hogwarts", etc.

In addition, although the model's memory of specific knowledge was erased, the model's other performance did not change significantly under the researchers' tests:

It is worth mentioning that the researchers also pointed out the limitations of this method: the model will not only forget the content of the book, but also forget the common sense of Harry Potter, after all, Wikipedia has relevant introductions to Harry Potter.

Forget all that information and the model might start to “hallucinate” nonsense.

In addition, this study only tested fictional texts, and the universality of the model performance needs further verification.

References: [1] https://arxiv.org/abs/2310.02238 (paper) [2] https://www.microsoft.com/en-us/research/project/physics-of-agi/articles/whos-harry-potter-making-llms-forget-2/