Using ChatGPT can result in a mixed bag of helpful information and nonsensical answers, making it hard to evaluate the chatbot's overall performance. And the companies making generative AI tools, including OpenAI, Google, and Microsoft, are secretive about the data they use and how their AI models truly work.
How to Test the Chatbots
To learn more about generative AI tools, 10 students and four faculty members at the University of California, Berkeley formed a group called the Large Model Systems Organization (LMSYS Org), within the AI research and computer science departments. LMSYS Org has created an experiment, the "Chatbot Arena," a custom website where anyone can anonymously chat with two models at once.
Once the user has formed an opinion on which chatbot's answers they prefer, they vote for a favorite and only afterward find out which models they were talking to. The site uses the same large language models (LLMs) that power ChatGPT and others and repackages the LLMs in a new interface, since companies such as OpenAI have made them available publicly. The site also contains smaller models created by individuals.
"We started this because we created our own AI model based off Meta's LLaMA model in April, [which we] called Vicuna, and we wanted to train different versions and iterate on it," says Hao Zhang, one of the professors at Berkeley leading the effort. "It mostly measures human preference, and its ability to follow instructions and do the task the human wants, which is a very important factor in making a model useful."
The group has steadily add more models to the arena, and since April, around 40,000 people have participated, Zhang says.
The Chatbot Arena
We tried the Chatbot Arena, below. Not knowing which two AI models the page chose for us to compare, we asked both to "draft an email to my family telling them I've booked flights for Thanksgiving, arriving on November 22 and leaving on November 30th." Each generated a suggested email. We selected Model B as the preferred option.
Then, the page revealed that Model B was Claude, an AI assistant made by Anthropic. Model A was a smaller model built by an individual named gpt4all-13b-snoozy.
Two AI models compete for the best response in the Chatbot Arena.The site takes into account every user's vote to create a rating using the Elo system, which "is a widely-used rating system in chess and other competitive games," an LMSYS Org blog post says.
"I've seen this leaderboard posted on multiple respected research sites," says Federico Pascual, who previously worked at Hugging Face, which maintains its own leaderboard of custom-built AI models. "This is an active area of research as people are figuring out how to evaluate these models. In three months or six months, [the Chatbot Arena leaderboard] will probably look different."'
And the Winner Is...
ChatGPT's most advanced model, GPT-4, currently tops the list with an Elo rating of 1,225. It's available with a ChatGPT Plus account ($20 per month). Next, two versions of Claude, made by Anthropic, rank second (1,195) and third (1,153). Claude is currently available via a waitlist; we were able to start using it within a few weeks.
The free version of ChatGPT is fourth, with its model, GPT-3.5 (1,143). OpenAI recommends GPT-3.5 for most daily tasks, since it runs faster than GPT-4 and is still very powerful. For that reason, it's also available on the paid version. But note that Microsoft's new Bing AI search, which is free, also runs on GPT-4.
With GPT-4 and GPT-3.5 at the top of the rankings, and the fact that Claude is waitlisted, ChatGPT and Microsoft Bing are the most accessible current favorites.
Chatbot Arena leaderboard as of June 2023.The model behind Google Bard, PaLM 2, ranks sixth (1,042). Zhang notes that Google makes multiple versions of PaLM 2, and he has not confirmed that the model in the Chatbot Arena is the same as the one behind Bard. Zhang has reached out to Google but says, "They are very secretive" and would not confirm. Separately, Zhang's team has compared the version in the Chatbot Arena with Google Bard, which confirmed it is "at least very close to the one people can access in Bard," if not identical.
Concerns About AI
From all his work with LLMs, Zhang has identified a few concerns about their widespread adoption. He agrees with OpenAI CEO Sam Altman, Elon Musk, Bill Gates, and others who have called for more AI regulation.
Specifically, Zhang thinks two issues need more attention. The first is data privacy, as these models are able to scrape the web and distill that data into usable information better than anything before. Another issue is keeping the data that powers the models high-quality and helpful. If AI models can generate their own content using what's available on the web, Zhang believes there won't be an incentive for humans to create new, better content.
"These large language models [rely on] quality content, which is created by humans," he says. "So if they don't incentivize people to create good materials, how can you guarantee they will improve the quality of life?"