The rise of inference

Since OpenAI released o1 on September 12th, “test-time compute” has been the talk of the town. It points to one observation: LLMs no longer only scale with data/energy/compute; they scale with “reasoning time” or time taken responding to a problem. 

AI researchers continue to spend more time optimizing model performance after training. Anthropic and others released tooling for LLMs to browse computer interfaces. In summary, everyone in startups and academia wants to bring about the age of agents. 

‍

‍

As the agentic and AI application market grows, so do the demands for inference. In addition to increased reasoning time with test-time compute, this suggests that the majority of compute power is shifting from training to inference. The more AI applications exist and the more complex they are, the more inference we need.

As Jensen Huang has said, “[Inference] is about to go up by a billion times.”

The rise of inference provides renewed optimism for the companies founded to meet those needs. 

‍

Enter Deep Infra

Deep Infra was founded in September of 2022 (fortuitous timing, eh?) to build an AI Inference cloud and host popular open-source models through a simple API. 

The equation for Deep Infra goes like this: Take brilliant programmers who win international coding competitions, add in experience building systems for 200M+ users, and you get the perfect background for building an AI infrastructure company.

The three co-founders—Nikola Borisov, Georgios Papoutsis, and Yessenzhar Kanapin—are the equivalent of international programming Olympic gold medal winners. Nikola won top honors in programming competitions across Eastern Europe in high school, while other members of the Deep Infra team have won puzzle-solving and mathematics competitions.

The team met each other running the backend infrastructure for imo, a messaging app with over 200M users and billions of messages sent daily. Outside of WhatsApp, it was one of the largest messaging platforms in the world. 

At imo, they came to one primary conclusion: the cloud was extremely expensive compared to the cost of the infrastructure you can build yourself.

‍

Deep Infra’s approach

On the surface, Deep Infra’s product is a simple, affordable, high-quality API to run inference of popular, open-source models. 

Customers sign up for a Deep Infra account, choose a model to run, and then get an API to integrate into applications. Deep Infra, in the background, then owns and operates hardware and optimizes models using a host of techniques:

‍

‍

The approach is comparable to that of a cloud provider: own and manage hardware, optimize software, and rent it out as a service. 

Here’s why this kind of service needs to exist:

The default option for integrating LLMs into applications is an API from the foundation model providers. But what if you want to fine-tune models, host your models, or optimize costs or latency for specific use cases?

  1. You could buy hardware, install it in a colocation data center, configure the GPUs, optimize your software for the hardware, and then undergo various challenges to ensure it runs smoothly. 
  2. You can also rent GPUs from a cloud provider, but you may encounter the same cost/latency problems you had with the foundation model APIs. 

The existence of inference providers like Deep Infra means companies don't need to buy or rent GPUs and optimize the models or infrastructure. They abstract all the challenges away and companies can just focus on integrating the AI models via a simple API. They abstract away all the challenges of managing hardware and focus on one variable: the most reliable and affordable inference possible.

There is no shortage of options for inference services: hyperscalers, other inference providers, hardware providers offering APIs, and foundation model companies themselves. The challenge is how a company like Deep Infra can stand out as the best choice.

‍

Market context & differentiation

Instead of focusing on just the other inference providers, Nikola shared that the bigger competition is against the approach itself. If the inference provider model works, there will be several winners. The more significant threat is from the foundation model companies, which all offer APIs, and hyperscalers, which have the resources to compete in this market. However, foundation model companies have less flexibility in optimizing inference performance.

Within the current inference provider landscape, many providers own or rent hardware, optimize it, and offer an API for that compute. Some of those providers include Replicate, Modal, Baseten, Fireworks AI, and OctoAI (recently acquired by Nvidia).

Deep Infra’s differentiation stems from the team’s knowledge of building large-scale infrastructure for hundreds of millions of users. They’re making opinionated architectural decisions to provide the best combination of variables that customers care about. 

Nikola shared that customers select providers based on four variables: quality, price, speed, and reliability. They initially optimized to be the lowest cost inference provider on the market:

‍

Top inference provider pricing comparison from Artificial Analysis

‍

The logic here is simple: models already have latency present, so companies can only optimize latency to a certain extent. It will still be there. If they can produce high-quality, highly reliable inference and do it at a low cost, they will have their place in the market. 

They’ve also made opinionated architectural bets on custom models and fine-tuning. As Nikola told us, 

‍“We’re allowing people to run their custom models, and that’s becoming increasingly important. We’re launched support for fine-tuned models and LoRA support for text and image models. Customers can bring their fine-tuned versions and run them with us at competitive prices. People want LoRAs because you don’t have to dedicate a full set of GPUs to run them. You can serve different versions of the model to different customers, letting the upfront cost of the model remain low."

Their goal is to make customized models as simple to run as possible. As Ilya Sustekever shared at NeurIPS in December, “We’ve achieved peak data, and there’ll be no more…We have to deal with the data that we have. There’s only one internet.”

The path to model improvements goes through specialization. Nikola explained that models are increasingly generic, and it’s becoming increasingly challenging to differentiate through processes like distillation. So, by taking a small model and teaching it about a specific task, companies can make them much more performant for particular tasks. 

Deep Infra wants to provide the infrastructure to make this as simple and easy as possible. 

‍

Deep Infra’s vision

Deep Infra wants to become the CDN of the LLM Age. In the internet era, companies and websites needed CDNs to reliably distribute internet content to users anywhere in the world. 

‍

‍

In the LLM era, companies will need an orchestrator of inference. They’ll need a reliable way to run their models and distribute them to end users all over the world. They’ll need to run their own fine-tuned models. As companies build more and more AI applications, they’ll need a company to “distribute” that application to their end users. 

Deep Infra wants to be that company.

Deep Infra will take open-source models (and may distribute them for the model providers as well!), scale all the infrastructure, and provide customers with a simple API. 

The difference between CDNs is that inference providers won’t be as decentralized. Nikola told us:

”For inference, it won’t be as decentralized as CDNs. It will likely end up being one large infrastructure space per continent. The latency is already significant enough that you don’t need the last mile of putting GPU equipment in every city. There will be a small set of applications that might need ultra-low latency. However, the latency of the model itself is already high enough to negate that need.”

For Deep Infra to achieve their vision, a few things need to happen:

  1. Increase in inference demand as AI applications continue to solve new and challenging problems.
  2. Continued supply of open-source AI models that are competitive with the top closed-source AI models.
  3. Continued technical innovation helps Deep Infra stand out in a competitive market. 

We’re in a golden era: As the capabilities of foundation models increase, the cost of inference is falling (decreasing 10x annually on average!). OpenAI’s most advanced model shatters previous state-of-the-art benchmarks; the only caveat is it costs over $1000 per task:

‍

OpenAI reasoning models on ARC-AGI Semi-Private evaluation benchmark

‍

As inference becomes cheaper, these models become economically viable for real-world applications like agents. And as these agents transition from promise to reality, Deep Infra will be here to support that promise.