According to Cointelegraph, a team of researchers from Tsinghua University, The Ohio State University, and the University of California at Berkeley has developed a method called "AgentBench" to evaluate the capabilities of large language models (LLMs), such as OpenAI's ChatGPT and Anthropic's Claude, in acting as real-world agents. LLMs have gained popularity for tasks like coding, cryptocurrency trading, and text generation.

Traditionally, LLMs have been benchmarked based on their ability to generate human-like text or by scoring on plain-language tests designed for humans. However, the team aimed to explore their potential as agents performing specific tasks in particular environments.

AgentBench, claimed to be the first of its kind, measures LLMs' abilities to perform challenging tasks in a variety of real-world environments, such as functioning in an SQL database, working within an operating system, and shopping online. Results showed that top-tier models like GPT-4 outperformed open-source models significantly, indicating their potential to develop continuously learning agents.