Article reprinted from: Machine Heart

This paper proposes seven main key dimensions to comprehensively evaluate LLM credibility.

In actual deployment, how to “align” large language models (LLMs), that is, to make model behaviors consistent with human intentions [2,3], has become a critical task. For example, OpenAI spent six months on alignment before releasing GPT-4 [1]. However, the challenge faced by practitioners is the lack of clear guidance on whether the output of LLMs is consistent with social norms, values, and regulations; this hinders the iteration and deployment of LLMs.

To address this issue, Liu Yang and other researchers from the ByteDance Research team provided a comprehensive survey on the key dimensions that need to be considered when evaluating the trustworthiness of LLMs. The survey covers seven main categories of LLM trustworthiness: Reliability, Safety, Fairness, Resistance to Misuse, Explainability & Reasoning, Social Norm, and Robustness.

Each major category is further subdivided into multiple subcategories, totaling 29 subcategories. In addition, the researchers selected 8 subcategories for corresponding evaluation studies. The evaluation results show that, in general, models with better alignment perform better in terms of overall trustworthiness. However, the effectiveness of alignment varies in different dimensions. This suggests that more detailed analysis, testing, and improvements to LLM alignment are needed. This paper aims to provide valuable insights and guidance to practitioners in this field by summarizing the key dimensions of trustworthy LLM, which is crucial to understanding how to reliably and reasonably deploy LLM in various applications.

Paper address: https://arxiv.org/abs/2308.05374

Large Language Model Alignment Classification

Figure 1 shows the taxonomy of credibility alignment for large language models proposed in this paper: there are 7 main categories, each of which is further broken down into more detailed discussions, for a total of 29 subcategories. The paper goes on to give an overview of each category:

Figure 1: The proposed large language model credibility alignment taxonomy.

1. Reliability => {false information, language model hallucinations, inconsistencies, miscalibration, flattery}

  • a. Generate correct, realistic, and consistent outputs with appropriate uncertainty.

2. Safety => {violence, illegality, harm to minors, adult content, mental health issues, privacy violation}

  • a. Avoid generating unsafe and illegal output, and avoid leaking private information.

3. Fairness => {injustice, stereotypes, preference bias, performance differences}

  • a. Avoid bias and ensure that performance differences across different populations are small.

4. Resist abuse => {propaganda, cyber attacks, social engineering, copyright leaks}

  • a. Malicious attackers are prohibited from abusing it.

5. Explainability and reasoning => {lack of explanatory power, lack of logical power, lack of causal power}

  • a. Ability to explain output to the user and reason correctly.

6. Social norms => {toxic language, emotional insensitivity, cultural insensitivity}

  • a. Reflect universally shared human values.

7. Robustness => {cue attacks, paradigm and distribution changes, intervention effects, poisoning attacks}

  • a. Resistance to adversarial attacks and distribution changes.

This paper’s analysis is based on the challenges of secure and trustworthy deployment in the era of big models, and also considers the discussion of trustworthy AI in the existing literature. At the same time, the definition and division of the main categories refer to the application of big models in society, and try to ensure that each evaluation dimension has a certain degree of relevance and importance in the mainstream big model application. The literature and discussion in each category and its subcategories can be found in the article.

For each subcategory, the article conducts relevant research and discussion, and also provides case studies to illustrate the problems of related models in related trust dimensions. For example, the following examples show some errors of ChatGPT on factual questions:

Using the following example, this article discusses possible illegal information in the output of a large model:

Evaluation Research

This paper selects 8 subcategories and designs corresponding evaluation studies. The target subcategories include:

  • Reliability: Language Model Hallucination

  • Safety and social norms: safety-related topics (e.g., violence, discrimination, hate speech, etc.)

  • Fairness: (gender) stereotypes

  • Reliability: Calibration Error

  • Resisting abuse: Propaganda and cyberattack abuse

  • Resisting Abuse: Disclosure of Copyright

  • Explainability: Causal Inference

  • Robustness: Robustness against typo attacks

The article takes "security and social norms" as an example. It first extracts some security-related keywords from the existing Anthropic RLHF red team dataset [4] (see the original paper for details), and then uses another LLM to generate prompts for unsafe instructions as test questions.

To this end, we choose text-davinci-003 as the model because: (1) it is an aligned model, but not heavily aligned on the safety dimension like ChatGPT, so it is better at generating controversial and offensive replies; (2) completely unaligned models (such as davinci) cannot understand the task well and follow human instructions.

Specifically, the researcher asked text-davinci-003: "Write a question that involves ." Choose from "unethical", "dangerous", "illegal", or "hostile". These are the keywords extracted previously. For example:

The article uses questions generated by text-davinci-003 as test questions.

Finally, the article uses GPT-4 to determine whether the test model refuses to respond to unsafe questions, as follows:

The researchers evaluated 6 LLMs: davinci, OPT-1.3B, text-davinci-003, flan-t5-xxl, gpt-3.5-turbo (ChatGPT), and GPT-4. Figure 2 shows the proportion of each LLM in the test set that was considered a safe response by GPT-4. From left to right on the x-axis, it shows one of the most well-aligned LLMs (GPT-4) from completely unaligned (davinci) to one of the most well-aligned LLMs to date.

The trend is as expected: when LLM is more aligned, it is more likely to refuse to answer unsafe instructions. Gpt-3.5-turbo (ChatGPT) and GPT-4 get nearly 100% safe.

Figure 2: LLM safety evaluation results. As expected, when the LLM is better aligned, it is more likely to refuse to answer unsafe questions.

For evaluation methods, details and results of other dimensions, please refer to the original paper.

Alignment Help

These generated evaluation data can also help collect alignment data.

Taking security as an example, in order to generate aligned training data, the responses of the labeled LLM are directly used. If GPT-4 determines that the model output contains harmful information, the researchers consider the output to be paired with the question and used as a negative sample in the aligned dataset. On the other hand, if no harmful information is detected, the researchers consider the question-output pairing to be a positive sample.

After aligning the generated data, the researchers used GPT-4 to compare the output results before and after alignment, asking it to judge which answer was better in terms of helpfulness, truthfulness, and harmlessness.

Table 1 shows the proportion of GPT-4 that is considered better in the test dataset after researchers did RLHF (Reinforcement Learning from Human Feedback) on GPT-2. Compared with the original model, the aligned model has been greatly improved.

Table 1: The proportion of outputs that GPT-4 considers better after aligning the data generated by researchers on GPT-2. Compared with the original model (Vanilla), the model after SFT and PPO has been greatly improved.

The article also performed supervised fine-tuning on LLaMA-7B using the generated evaluation data and found that 78% of the outputs after fine-tuning were considered better than before fine-tuning.

in conclusion

This article provides practitioners with a survey of the LLM credibility dimension, and comprehensively analyzes the directions and issues that need to be considered and paid attention to in the process of building a trustworthy large model. The evaluation results of the article show that the effectiveness of alignment is inconsistent in different dimensions, so practitioners should do more fine-grained testing and improvement of LLM alignment. At the same time, the research in this article shows that the data generated by the evaluation can also help complete the alignment task of large models.

Practitioners urgently need more principled approaches to assessing and implementing LLM alignment, ensuring that these models adhere to societal values ​​and ethical considerations. As the field advances, addressing these unresolved issues will be critical to building increasingly reliable and accountable LLMs.

Thanks to Li Hang for his suggestions and help in revising this article.

references

[1] OpenAI. Gpt-4. https://openai.com/research/gpt-4, 2023.

[2] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022. 

[3] Zachary Kenton, Tom Everitt, Laura Weidinger, Iason Gabriel, Vladimir Mikulik, and Geoffrey Irving. Alignment of language agents. arXiv preprint arXiv:2103.14659, 2021.

[4] https://github.com/anthropics/hh-rlhf/tree/master