Article reprint source: AIcore

Source: Quantum Bit

The first major language model paper led by Turing Award winner Yao Qizhi is here!

The first move is to "make the big model think like a human" -

Not only should large models be allowed to reason step by step, but they should also learn to "advance cautiously at every step" and remember all the correct processes in the process of reasoning.

Specifically, this new paper proposes a new method called Cumulative Reasoning, which significantly improves the ability of large models to perform complex reasoning.

You know, the big model can be used for problem reasoning based on the chain of thinking, but it is still easy to make mistakes when faced with problems that "require several turns".

Cumulative reasoning is based on this, adding a "verifier" to judge right and wrong in a timely manner. As a result, the thinking framework of the model has changed from chain and tree to a more complex "directed acyclic graph".

In this way, the big model not only makes the problem-solving ideas clearer, but also gives rise to a skill of "playing cards":

When solving difficult math problems such as algebra and geometric number theory, the relative accuracy of the large model increased by 42%; when playing 24 points, the success rate soared to 98%.

According to the Institute for Cross-Disciplinary Information Sciences at Tsinghua University, co-first author Zhang Yifan explained the starting point of the paper:

Kahneman believes that human cognitive processing consists of two systems: "System 1" is fast, instinctive and emotional, and "System 2" is slow, thoughtful and logical. At present, the performance of the large language model is closer to "System 1", which may be the reason why it is not good at dealing with complex tasks.

Cumulative reasoning designed from this perspective is more effective than chain of thought (CoT) and tree of thought (ToT).

So what does this new approach look like? Let’s take a look.

Break through the thinking chain & establish "bottleneck"

The core of cumulative reasoning is to improve the "shape" of the big model thinking process.

Specifically, this method uses three large language models:

  • Proposer: Continuously proposes new propositions, that is, suggests what the next step should be based on the current thinking context.

  • Verifier: Checks the accuracy of the proposer's proposition and adds it to the thought context if it is correct.

  • Reporter: Determines whether the final solution has been obtained to determine whether to end the reasoning process.

During the reasoning process, the "proposer" first makes a proposal, the "verifier" is responsible for evaluation, and the "reporter" decides whether to finalize the answer and terminate the thinking process.

△CR reasoning example

It’s a bit like the three roles in a team project: the team members brainstorm various ideas first, the instructor “checks” to see which ideas are feasible, and the team leader decides when to complete the project.

So, how exactly does this approach change the “shape” of big model thinking?

To understand this, we must first start with the Chain of Thought (CoT), the originator of the big model thinking enhancement method.

This method was proposed by OpenAI scientist Jason Wei and others in January 2022. The core is to add a paragraph of "step-by-step reasoning" text to the input in the data set to stimulate the thinking ability of the large model.

△Selected from GSM8K dataset

Based on the principle of thought chain, Google also quickly followed up with a "thought chain PLUS version", namely CoT-SC, which mainly conducts multiple thought chain processes and selects the best answer by majority vote to further improve the accuracy of reasoning.

But both the thought chain and CoT-SC overlook one problem: there is more than one solution to a problem, and this is especially true for humans.

Therefore, a new research called Tree of Thought (ToT) emerged.

This is a tree-like retrieval scheme that allows the model to try multiple different lines of reasoning, self-evaluate, and choose the next course of action, and backtrack when necessary.

From the method, we can see that the thinking tree goes one step further than the thinking chain and makes the big model thinking "more active".

This is also why when playing 24 points, the success rate of GPT-4 with the thought chain bonus is only 4%, but the success rate of the thought tree soars to 74%.

BUT, whether it is the thinking chain, CoT-SC or thinking tree, they all have a common limitation:

None of them provide a storage location for the intermediate results of the thinking process.

After all, not all thought processes can be made into chains or trees, and the way humans think is often more complicated.

This new cumulative reasoning framework breaks through this point in design——

The overall thought process of the big model doesn’t have to be a chain or a tree, it can also be a directed acyclic graph (DAG)! (Hmm, it has a synaptic flavor)

△The edges in the graph are all directed, and there are no loops; each directed edge is a derivation step

This means that it can store all historically correct inferences in memory for exploration in the current search branch. (In contrast, the mind tree does not store information from other branches)

But cumulative reasoning can also switch seamlessly with the chain of thinking - as long as the "verifier" is removed, it becomes a standard chain of thinking model.

Cumulative reasoning designed based on this method has achieved good results in various methods.

Good at math and logical reasoning

The researchers chose the FOLIO wiki and AutoTNLI, 24-point game, and MATH datasets to "test" cumulative reasoning.

The proposer, verifier, and reporter use the same large language model in each experiment and use different prompts to set their roles.

The basic models used for experiments here are GPT-3.5-turbo, GPT-4, LLaMA-13B, and LLaMA-65B.

It is worth mentioning that ideally, the model should be pre-trained using relevant derivation task data, and the "verifier" should also include formal mathematical provers, propositional logic solver modules, etc.

1. Logical reasoning ability

FOLIO is a first-order logic reasoning dataset, and the labels of questions can be "true", "False", or "Unknown"; AutoTNLI is a high-order logic reasoning dataset.

On the FOLIO wiki dataset, compared with the direct output results (Direct), thought chain (CoT), and advanced thought chain (CoT-SC) methods, cumulative reasoning (CR) always performs best.

After deleting problematic instances in the dataset (such as incorrect answers), the inference accuracy of GPT-4 using the CR method reached 98.04%, with a minimum error rate of 1.96%.

Let’s look at the performance on the AutoTNLI dataset:

Compared with the CoT method, CR significantly improved the performance of LLaMA-13B and LLaMA-65B.

On the LLaMA-65B model, the improvement of CR compared to CoT reached 9.3%.

2. Ability to play 24-point games

The original ToT paper used a 24-point game, so here the researchers used this dataset to compare CR and ToT.

ToT uses a search tree of fixed width and depth, while CR allows large models to autonomously determine the search depth.

The researchers found in their experiments that in the context of 24 points, the CR algorithm and the ToT algorithm are very similar. The difference is that the CR algorithm generates at most one new state per iteration, while ToT generates many candidate states in each iteration and filters and retains some of them.

In layman's terms, ToT does not have the "verifier" mentioned above that CR has, and cannot judge whether the state (a, b, c) is correct or not. Therefore, ToT will explore more invalid states than CR.

Finally, the accuracy of the CR method can even reach 98% (ToT is 74%), and the average number of access states is much less than ToT.

That is to say, CR not only has a higher search accuracy, but also has a higher search efficiency.

3. Mathematical ability

The MATH dataset contains a large number of mathematical reasoning questions, including algebra, geometry, number theory, etc. The difficulty of the questions is divided into five levels.

Using the CR method, the model can break down the question into sub-questions that can be completed well step by step, asking and answering itself until an answer is generated.

Experimental results show that CR’s accuracy exceeds that of existing methods under two different experimental settings, with an overall accuracy of 58%. It also achieved a 42% relative accuracy improvement in Level 5 problems, achieving a new SOTA under the GPT-4 model.

Tsinghua University's Yao Qizhi and Yuan Yang led the research

This paper comes from the AI ​​for Math research group led by Yao Qizhi and Yuan Yang from the Institute of Cross-Disciplinary Information Sciences at Tsinghua University.

The co-first authors of the paper are 2021 doctoral students Zhang Yifan and Yang Jingqin from the School of Interdisciplinary Information Sciences;

The instructor and co-corresponding authors are Assistant Professor Yuan Yang and Academician Yao Qizhi.

Zhang Yifan

Zhang Yifan graduated from Yuanpei College of Peking University in 2021 with a bachelor's degree. He is currently studying under Assistant Professor Yuan Yang. His main research directions are the theory and algorithms of basic models (large language models), self-supervised learning, and trustworthy artificial intelligence.

Yang Jingqin

Jingqin Yang received his bachelor's degree from the Institute of Cross-Disciplinary Information Sciences at Tsinghua University in 2021 and is currently pursuing a doctorate degree under the tutelage of Assistant Professor Yuan Yang. His main research areas include large language models, self-supervised learning, and intelligent healthcare.

Yuan Yang

Yuan Yang is an assistant professor at the School of Interdisciplinary Information Sciences at Tsinghua University. He graduated from the Department of Computer Science at Peking University in 2012 and received a Ph.D. in Computer Science from Cornell University in 2018. From 2018 to 2019, he worked as a postdoctoral fellow at the School of Big Data Science at MIT.

His main research directions are intelligent medical care, basic AI theory, and applied category theory.

Yao Qizhi

Yao Qizhi is an academician of the Chinese Academy of Sciences and dean of the Institute of Cross-Disciplinary Information Sciences at Tsinghua University. He is also the first Asian scholar to win the Turing Award since its inception and the only Chinese computer scientist to have received this honor so far.

Professor Yao Qizhi resigned from his tenured position at Princeton in 2004 and returned to Tsinghua to teach; in 2005, he founded the "Yao Class", a computer science experimental class for Tsinghua undergraduates; in 2011, he established the "Tsinghua Quantum Information Center" and the "Institute of Cross-Disciplinary Information Sciences"; in 2019, he founded the Artificial Intelligence School Class for Tsinghua undergraduates, referred to as the "Zhi Class".

Today, the Institute of Cross-Disciplinary Information Sciences at Tsinghua University, which he leads, has already become well-known, and both the Yao Class and the Zhi Class are affiliated with the Institute of Cross-Disciplinary Information Sciences.

Professor Yao Qizhi's research areas include algorithms, cryptography, quantum computing, etc. He is an international pioneer and authority in this field. Recently, he appeared at the 2023 World Artificial Intelligence Conference. The Shanghai Qizhi Research Institute he leads is currently studying "embodied general artificial intelligence."

Paper link: https://arxiv.org/abs/2308.04371