Hey Open AI Sea visitor! There’s a big test called “AgentBench” that checks which program is the best at helping people with different tasks. Think of it like superheroes competing to be the best sidekick!

The test has all sorts of challenges, like using a computer, playing digital card games, solving tricky puzzles, and even shopping online. There’s a champ called GPT-4 that totally rocks this test, except in online shopping where GPT-3.5 shines.

AgentBench Research Report on LLMs

A bunch of other programs tried the test too, but GPT-4 was the superstar. Another program called Claude did pretty well too. But most other programs couldn’t keep up, especially the free ones.

The AgentBench research team is creating a cool toolkit and some data stuff. They’re putting it on a website called Github for other researchers to use and more extensive performance comparisons.

