Google Bard just made a stunning leap in capabilities…
It just beat GPT-4 on a top leaderboard that evaluates AI models.
The leaderboard, called Chatbot Arena, comes from the Large Model Systems Organization. And it now shows Google Bard (powered by Google's Gemini Pro model) now in 2nd place in terms of performance.
The leaderboard takes into account 200,000+ human votes on which models users prefer.
It also assigns an "Elo" rating to each model, which is a method of calculating how good players are at zero-sum games like chess.
Bard still trails behind GPT-4 Turbo, but now surpasses other versions of GPT-4 and other popular models like Claude and Mistral.
What should you do now that Bard is climbing the rankings?
In Episode 81 of The Marketing AI Show, I got the answer from Marketing AI Institute founder/CEO Paul Roetzer.
Here’s what you need to know…
Chatbot Arena isn’t just a random online ranking site, says Roetzer. It’s the real deal.
It’s trusted by some of the top players in AI, including Andrej Karpathy, a leading AI researcher at OpenAI. (In fact, Karpathy says it’s one of only two evaluation sites he trusts.)
The human evaluation component of Chatbot Arena works by having you pit two models against each other for the same prompt. (Hence the name.)
For instance, you can give Bard (powered by Gemini Pro) and GPT-4 the same prompt, get two different outputs, and rate which one is best.
When pitted against several versions of GPT-4, Bard comes out the winner. However, it still falls short when matched against GPT-4 Turbo, the latest version of OpenAI’s most advanced model.
Not to mention, Gemini Pro, which now powers Bard after a December 2023 update, isn’t even the most powerful version of Google’s new models.
Gemini Ultra is the most powerful version of Google’s family of advanced models—and Google plans to incorporate it into its services and AI tools moving forward. Which means Ultra may be an even bigger leap forward.
This doesn’t mean you should drop all your other tools and switch to Bard, says Roetzer.
AI tools improve at an insanely fast pace. As Bard shows us, a tool that was lagging behind can quickly become a leader, almost overnight.
“This is why it is so hard to make bets on which platform to use and which ones to integrate into your workflows,” says Roetzer. “Because they keep evolving as to which is best for which use cases.”
“You have to constantly be testing different tools.”
Roetzer recommends having one or more team members test different tools against your core AI use cases (blog writing, summarization, script writing, etc.) every 30-90 days—or whenever the leaderboards see a significant change.
“Go in and run those use case tests against the different systems and see if someone has made a leap forward that changes the kind of technology the rest of your team should be using.”