reading
12 min read

In The Arena

How Anastasios Angelopoulos and Wei-Lin Chiang Built AI's Most Trusted AI Evaluation Platform

Matt Quinn

Arena Co-Founders Anastasios Angelopoulos and Wei-Lin Chiang
Anastasios Angelopoulos and Wei-Lin Chiang

It was around spring 2024 when friends noticed that PhD student Wei-Lin Chiang had begun carrying his laptop everywhere. Suddenly, it was his constant companion at every meal, gathering, and errand.

“He wanted to make sure that if something broke, he could get it back online,” says his research partner, Anastasios Angelopoulos. “Growing the platform was really a labor of love for him.”

While PhD students at the University of California, Berkeley, the pair had created Arena (then known as “Chatbot Arena”) as part of a research project to rank chatbots based on human preference. On the site, a user would enter a prompt, receive responses from two anonymous models, pick which response they liked better, and then the bots’ identities were revealed. That raw, crowdsourced data then fed into a leaderboard that ranked publicly available models.

The project’s timing was auspicious. After years of development in academic and frontier labs, large language models were delivering on AI’s hype. Prompts and queries produced human-like output and bots could (mostly) follow along in conversations. Competition to claim the “best” or “SOTA” model was heating up among frontier labs, giving Chatbot Arena users no shortage of models to test and vote on.

The research portion of the project had ended, but the site had taken on a life of its own. The users kept coming back. Tens of thousands of people a month from around the world used it to try out different models. Frontier labs had taken notice and begun to test models on the site before their public release. Site traffic kept growing.

Wei-Lin and Anastasios found their lives suddenly changing to accommodate the project’s needs. They worked such long hours that they moved in together. Vacations were out of the question. It became clear they had a decision to make.

“I pivoted entirely from my previous research to just doing this Arena project,” says Wei-Lin. “We were at a very critical moment. We had to decide, is this just a research project we wind down? Or is this something more?”

An academic pursuit becomes a market

Wei-Lin had never intended to build AI products, let alone evaluate the underlying technology. Under Ion Stoica, his advisor and the director of UC Berkeley’s Sky Computing Lab, his previous work at Berkeley had centered on systems, including a project called SkyPilot (opens in new tab) and another around reinforcement learning for optimizing database systems. But when OpenAI released its public preview of ChatGPT in late 2022, the floodgates opened. New models, breakthroughs, and, most importantly, opportunity emerged.

“It completely changed how I think about the future and how AI could impact the real world,” says Wei-Lin, who was in the third year of his PhD work when ChatGPT made its debut. “The question for me became, ‘How can I do research that’s relevant to this?’”

“We were just a bunch of PhD students working together as a weekend project, building a prototype.”

Wei-Lin ChiangArena Co-Founder and CTO

To figure that out, Wei-Lin first needed to understand the technology’s capabilities and limitations. Over the course of about four months, Wei-Lin and a group of fellow students created Vicuna, an open source chatbot built by training Meta AI’s LLaMA on conversations that users had with ChatGPT and then shared on a public forum.

“We were just a bunch of PhD students working together as a weekend project, building a prototype,” Wei-Lin says.

But Wei-Lin and his fellow students weren’t the only ones building their own models. They wanted a way to prove that the model they’d built performed better than others, especially one in particular.

“Stanford actually had their own version of this open source chatbot, and we wanted to prove that ours is better – basically that Berkeley is better than Stanford,” says Wei-Lin, laughing.

Original Chatbot Arena research team
Chatbot Arena: small team, big ideas

For the initial work, the team fed queries and responses from the chatbots they were assessing into ChatGPT and let it decide which was best. They hoped to leverage the advancements in OpenAI’s model to automate the evaluation process. The method demonstrated some potential, but it lacked sufficient rigor. As the team wrote in a blog (opens in new tab) at the time, “Building an evaluation system for chatbots remains an open question requiring further research.”

As AI technology has advanced, the demand for reliable evaluation and benchmarking tools has skyrocketed. Many benchmarks are static: multiple-choice questions or tasks with predefined answers. These tools are invaluable to the researchers building AI systems, and benchmarking scores unquestionably offer an important signal to users. But they’re also easy to game, incentivizing teams to build for benchmark scores that demonstrate a model’s mastery of certain subjects and abilities, while ignoring real-world applications. It’s the AI version of teaching to test.

As increasingly capable chatbots flooded the market, Wei-Lin was struck by the shortcomings of existing evaluation methods. With Ion advising him, Wei-Lin enlisted his fellow students to compare how an LLM’s evaluation choices stacked up against human preferences. This became the first version of Chatbot Arena. The project took off: in just one month, Chatbot Arena logged 30,000 votes, well beyond the group’s expectations. The research produced a paper (opens in new tab), but more importantly, a spark. Wei-Lin wanted to keep going.

Chatbot Arena research team eating ramen
Grad school is temporary; ramen is forever

A shared fixation on seeking truth

The problem of how to measure the real-world performance of AI was also on Anastasios’s mind after ChatGPT debuted. At the time, his PhD research focused on how to use statistical and mathematical methods to make machine learning models more reliable.

Anastasios knew AI would change the world, and he wanted to help make it more reliable. “That's why evaluation was a good fit for me. Evaluation is all about, how do you measure things properly, and how do you make statistically valid claims about model performance that are really supported by evidence?” he says.

His advisor, UC Berkeley statistical machine learning luminary Michael I. Jordan, connected him with Wei-Lin and Ion as they plotted next steps for Chatbot Arena. Foundations for the research were in place when Anastasios joined with relatively modest expectations.

“Wei-Lin and the rest of the team working on the project were all super smart, but the platform was not close to what it is now, in any sense,” he recalls. “It was like an early-stage research project. At the end of the day, we thought it was going to be a paper, not a company.”

Wei-Lin Chiang and Anastasios Angelopoulos
Built on trust: from research partners to co-founders

After the Chatbot Arena paper was published in March 2024, many of the students involved moved on and returned to their own research. But the work had become something more to Wei-Lin and Anastasios. They didn’t feel done with the project; on the contrary, it was taking up more and more of their lives. Their partnership had deepened, too: they realized they shared a fixation on seeking truth. And traffic to the site just kept growing. They knew they had a decision to make.

Measuring and advancing the AI frontier

Around the time that Wei-Lin began bringing his laptop everywhere, OpenAI researchers brought them a mystery model to test on the site. Later, it was revealed as a prerelease of ChatGPT 4. The impact was instant.

“GPT 4 released and our traffic basically went 10X,” says Anastasios. “And then a lot of those users stayed.”

“I believed there was something unique we could do. We were in a very different position than the frontier labs. We could be an academic voice for measuring models for the community. That sounded durable to me.”

Wei-Lin ChiangArena Co-Founder and CTO

OpenAI debuting a new model on Chatbot Arena sent a clear message to the team: the leaderboard had found a place in the burgeoning AI ecosystem. Something had fundamentally changed. The scope of the project had expanded, and so had their ambitions. “That was a huge deal to us. It wasn’t just a research project anymore,” says Wei-Lin. “They actually cared about us. They cared about the leaderboard.”

Beyond the interest from frontier labs, Wei-Lin and Anastasios believed that, as academics, they could build something that users could trust, even as their platform faced scrutiny that intensified along with its popularity. “I believed there was something unique we could do at the lab,” Wei-Lin says. “We were in a very different position than the frontier labs. We could be an academic voice for measuring models for the community. That sounded durable to me.”

“The fact that our leaderboard is an open source methodology means that anybody on the internet can go look and see how the leaderboard is calculated,” says Anastasios. “We have probably open-sourced more organic conversation data through Arena than any other benchmark in the world. So when people ask about the leaderboard and provide feedback, we're always open to it, and we hope to earn trust with the community through that transparency.”

Ion, with deep founder experience from Databricks and Anyscale, also felt the shift that came with the attention from the labs and influential tech figures like Andrej Karpathy, Jeff Dean, Elon Musk, and Sundar Pichai. He continued to work closely with Wei-Lin and Anastasios, noting their “conviction, drive, and a great balance between technical depth and ability to inspire and lead the community.”

In September 2024, Anastasios and Wei-Lin migrated Chatbot Arena away from the Berkeley-backed site it had been on to its own home, lmarena.ai (opens in new tab), and changed its name to “LMArena.” The move gave them increased independence and set the stage to expand leaderboard rankings beyond chatbots. Image Arena, Search Arena, Video Arena, WebDev Arena, and others were soon added.

As Anastasios, Wei-Lin, and Ion debated what came next for LMArena, maintaining neutrality and scientific integrity were non-negotiable. But ultimately, they wanted to create something that could provide real value by evaluating the reliability of any AI model.

We really have a responsibility to make sure AI develops in a reliable way that's aligned with human values. And you can't understand or improve what you can't measure.

Anastasios AngelopoulosArena Co-Founder and CEO

“We felt that if we didn't start a company, we would be resource-starved,” says Anastasios. “We would not be able to recruit the best people and gather the financial resources needed in order to support a platform that really did the best evaluations at that scale without starting a company. And so it became existential. We either do this or the forum likely won't survive.” So in April 2025, they incorporated LMArena.

Side by Side Mode

Under LMArena’s new structure, Anastasios was named CEO, Wei-Lin became CTO, and Ion assumed the role of chairman.

LMArena continued to move as fast as the technology they’re measuring. They’ve already been through a corporate name change (now rebranded “Arena”) and two fundraising rounds (Felicis participated in their Seed Round and led their Series A). Last fall, Arena launched its first commercial product, AI Evaluations (opens in new tab), a service for enterprises, model labs, and developers to understand how their products perform in the real world.

Moving from the lab to revenue so quickly is a reflection of the position Arena has already cemented in the AI value chain. As Felicis General Partner and Arena board observer Peter Deng points out, consumers use Arena because it gives them free access to the newest and most powerful models, often before broad release, and the ability to evaluate frontier AI directly. The model labs gain high-quality, real-world evaluation data. “As a result, Arena has become the backbone of how the world understands and advances AI,” says Deng.

Ion has been instrumental in helping the two founders find their footing, even serving as Arena’s interim head of machine learning. “Ion treats us as equals,” says Anastasios. “He treats us like our opinions really matter, he listens very carefully, and he will contribute his thoughts and we'll have debates and discussions and arguments—in a good way.”

And while Anastasios and Wei-Lin are no longer roommates, their friendship has only strengthened. They love sharing dinner together, along with their partners, especially over Wei-Lin’s “world famous” hot-pot.

The most important part of their relationship is the trust they’ve formed. “We spend every day talking to each other about some of the most important career decisions and financial decisions of our lives, so we’d better trust one another,” says Anastasios. “Part of that is having no ego. We basically have to never put ourselves before the company.”

The founders find themselves in a slightly surreal position. Anastasios had imagined he would strike out on his own one day, but that wasn’t the plan when he joined Wei-Lin’s project just a couple of years ago.

“If not for this, I would have been a professor right now,” he says. “I still hope to keep my academic roots, but right now I’m just on an unexpected and happy path.”

That path was made possible by the overabundance of AI models that continue to emerge. As AI gets more powerful and spreads into more facets of life and the economy, ensuring its reliability will only become more critical. With that in mind, Anastasios, Wei-Lin, and Ion will grow and shape Arena with the aim of making it into the world's evaluation infrastructure.

“AI is the most important technology of our lifetimes, and potentially one of the most important technologies ever, in the same category as agriculture and electricity and the internet,” says Anastasios. “We really have a responsibility to make sure AI develops in a reliable way that's aligned with human values. And you can't understand or improve what you can't measure. That’s where Arena sits.”

Authors

  • Matt Quinn

    Matt Quinn is a freelance journalist living in the Bay Area.

Tags

    AIInfraFounder Profile

Share

Newsletter

Get the latest news & insights

from the Felicis community.