June 24, 2024

The rise of AI data infrastructure

“We’re at the beginning of a new Industrial Revolution. But instead of generating electricity, we’re generating intelligence… [Open source] activated every single company. Made it possible for every company to be an AI company.”
— Jensen Huang, CEO of NVIDIA

Introduction

We admire founders with bold visions for transforming our world.

Currently, teams across the stack are building the infrastructure to usher in the intelligence revolution. We’ve seen value from semiconductors to data centers to cloud providers, and we believe the next area of infrastructure investment will be in the data infrastructure layer that will bring custom AI applications to life. 

The need to extract information from documents isn’t new, but now, with GenAI, we have applications that need this fuel - and a lot of it. High-quality data is necessary for both training and inference, and companies need a way to acquire this data. It’s not just the scale of data that’s changing. The kinds of data we are getting are also evolving beyond text and tabular to video, images, and audio. We’re also seeing the growth of spatial data like satellite imagery and robot sensor data.

But let’s not lose sight of the question: What net new areas in this data layer have the most immediate opportunity to be re-invented due to AI?

We’re seeing innovation across the data landscape in unstructured data extraction and pipelining, retrieval-augmented generation (RAG), data curation, data storage, and AI memory.

Our goal for this article is to break down the AI data infra landscape, share the trends we’re seeing, and discuss the most promising areas of innovation.

First, it’s important to have some background on the data infra landscape.

The AI data infra landscape

When making this graphic, we wanted to (as simply as possible) show the flow of data across the AI value chain, including the flow of data for training and inference. 

When we look at the data infra value chain, we see five areas:

  1. Sources
  2. Ingestion & Transformation
  3. Storage
  4. Training
  5. Inference
  6. Data services

We visualized these segments here:

AI Data Infrastructure Value Chain: This graphic outlines the AI data value chain, segmented into six key areas: Sources, Ingestion & Transformation, Storage, Training, Inference, and Data Services. Sources: Apps (Salesforce, ServiceNow), OLTP databases (Oracle, MongoDB), Synthetic Data (Mostly AI, Datagen, Tonic), Web Data (Browse AI, Apify). Ingestion & Transformation: Streaming (Kafka, Confluent), Processing (Flink), Orchestration (Astronomer, Dagster, Prefect, Airflow), Labeling (Labelbox, Scale), Extract/Load (Matillion, Fivetran), Transform (dbt, Coalesce), Unstructured Data Pipelines (Datavolo, Unstructured, LlamaIndex). Storage: Data Lake (Databricks, Onehouse, Tabular, Amazon S3, Google Cloud Storage, Microsoft Azure Blob Storage), Vector Database (Pinecone, Chroma, Milvus, Weaviate, Supabase). Training: Training (TensorFlow, Modular), Evaluation (Neptune.ai, Weights & Biases), MLOps (Databricks, H2O.ai, Dataiku, Domino), Model (OpenAI, Cohere, Mistral AI, Runway, Anthropic). Inference: Tooling (Anon, B, E2B, Contextual AI, LangChain, Databricks), Memory (MemGPT, Cognee.ai), Agent/App (Character.ai, AI, Harvey, NormAI), RAG Framework (LangChain, LlamaIndex, contextual.ai, Databricks). Data Services: Data Security (Rubrik, Dig Security, Eureka, Cyera, Imperva, Sentra, Varonis, BigID), Data Catalog/Lineage/Discovery (Atlan, Alation, Collibra, Informatica, data.world), Data Quality/Observability (Anomalo, Datology, Observe, Cleanlab, Scale, Metaplane, Monte Carlo). This image is a visual representation of the components and tools involved in AI data infrastructure, highlighting companies and technologies relevant to each segment.
This map is meant to be a mental model and not exhaustive. Companies may work across areas, but we tried to limit to one per category as much as possible.

Some context on each segment:

Sources:

Data sources and types vary across use cases. Traditionally, a company's business data is stored primarily in business applications like Salesforce, and transactional data sits in PostgreSQL or Oracle databases. Additionally, pulling data in real-time for analysis requires other data sources like sensors, manufacturing, and some healthcare data, which we broadly describe as “real-time” data. 

For AI specifically, we’ve seen the growth of importance in synthetic data and web data. Synthetic data is artificially generated rather than collected from real-world events. Synthetic data provides an alternative and is significantly cheaper than acquiring, cleaning, and labeling real-world data while maintaining data compliance. We’ve seen its application increase across ML training; anecdotally, we’ve heard that synthetic data doesn’t help optimize model performance because it doesn’t represent outlier statistical data well. While training datasets aren’t solely synthetic data for this reason, we’ve seen synthetic data become more mainstream with NVIDIA’s recent announcement of Nemotron-4 340B, a family of open models developers can use to generate synthetic data for training LLMs.

Web data provides access to any public data for training or fine-tuning models. Web scraping is not a new concept, but what is new is the sheer amount of scraping volume necessary to gather enough quality data to train large models. To put this in perspective, a study from Epoch AI projects that tech companies will exhaust the publicly available training data for AI language models sometime between 2026 and 2032. Webscraped data has been central to large foundation models’ training data sets. 

Data ingestion and transformation:

After choosing sources, companies will need to ingest the data, transform it, and move it to a destination to leverage it. 

The overall goal of data pipelines is simple: get data from source to destination in a format that is easy to analyze or act on. Traditionally, in the data engineering world, this is ETL or ELT. In the ML world, where most data is tabular, it is called feature engineering/pipelines. With GenAI, we need to extract, parse, and prepare unstructured data. We’ll refer to these holistically as data pipelines. Again, data pipelines are a decades-old technology; what’s new is the variety and scale of data needing to be transferred. 

Data pipelines traditionally fall into two buckets: batch (extract and load in specific intervals) and streaming (loading data as it becomes available). However, a new pipeline category has emerged for unstructured data processing; these pipelines offer end-to-end workflows from unstructured data to storage. 

Transformations are pipeline-dependent. Batch pipelines typically use a tool like dbt. We have come across teams that use dbt to create ML features. This still works well for structured data. Streaming pipelines will use message queues to ingest data and compute engines like Flink to run transformations on that data. 

Orchestrators like Airflow manage these workflows for scheduling, execution, and organization.

For training workloads, data may then be filtered and labeled. Data labeling assigns labels or context to data so ML models can learn from those labels. Any supervised learning needs properly labeled test data so the model can learn what is “right” and “wrong.” Acquiring labeled data at scale is challenging and has led to the ascendance of prominent startups like Scale AI and LabelBox. Open-source options include CVAT, LabelMe, and Stanford Core NLP. 

Storage:

Traditionally, analytical data is stored in data warehouses. But increasingly, data is stored in a data lake and queried using the lakehouse architecture of open tables, catalogs, and query engines. Data is typically stored as embeddings in vector databases for AI workloads using unstructured data.

Model training:

AI algorithms use three main types of training: supervised, unsupervised, and reinforcement learning. In supervised learning, the model is given labeled data and learns to output results matching that labeled data. In unsupervised learning, the model shows large amounts of data and learns relationships independently. 

For large language models, “pre-training” typically consists of unsupervised learning, allowing the model to recognize patterns in the dataset. Then, the model is trained using supervised learning to optimize its performance. Custom ML models are typically trained using supervised learning. 

Next, the models will undergo reinforcement learning, also called RLHF, reinforced learning through human feedback. As the name suggests, the model generates output and receives feedback from humans on how to improve its output. 

Throughout this process, the model will continually be evaluated to see how well it models a given situation. It looks at variables such as accuracy, precision, loss minimization, overfitting, underfitting, and other statistics specific to the model’s use case.

Finally, the models will undergo various final steps, including security testing, governance, and auditing, to ensure they generate safe user outputs and don’t have security or compliance issues.

Model inference:

For LLM inference, a model receives a prompt and then tokenizes and vectorizes this data (also called prefilling). That data is run through the model to generate an output for the user (also called decoding). 

When the LLMs require personalization, this process gets more interesting. As mentioned, a company may store data in a vector database and connect it to an LLM using an LLM customization platform. When a user inputs a prompt into their app, data will also be pulled from the company’s vector database to generate a unique answer using the LLM. A similar architecture can be used for an AI agent to have the context of a company or user’s environment and take action on the users’ behalf. 

Data must be tracked and managed throughout this process to ensure data security, model quality, and compliance—ergo, data services.

Data services:

Data services is a broad category that organizes and secures data. AI demand is increasing the variety and scale of data and the tooling around data for applications. This leads to challenges in managing that data, securing it, and ensuring governance is in place surrounding practices on that data. 

Data security traditionally involves securing access to data and ensuring it isn’t accessed or stolen by bad actors. Given the amount and importance of data today, those principles still hold, just on a much larger scale, given the amount and importance of data today. Features like data security posture management, data access control, data loss prevention, and data detection and response are critical categories in data security companies today. 

Data observability is the monitoring of data quality and performance across data pipelines. These tools detect anomalies, maintain visibility into data pipelines (schema changes, compute-heavy queries, critical objects), and track data movement. 

Finally, data catalogs are at the heart of data management. They centralize metadata, allowing a company to organize its data assets. From there, tools like observability, lineage, and discovery can access that metadata to provide insights.

Data security, observability, and management are closely linked; the more organized a company is with its data, the more successful it will be on all three fronts. 

Data re-invention due to AI

Within AI, we specifically see innovations across the following areas:

1. Unstructured pipelines for AI agents and apps

The most immediate area of re-invention we see in data infrastructure is the rise of unstructured data pipelines for AI applications. Teams want to use their internal unstructured data to power conversational AI and agent applications. 

These pipelines include similar steps to traditional data pipelines: extraction, transformation, indexing, and storage. Today, the most common unstructured data is text from PDFs, knowledge bases, and images because they support conversational AI use cases. Often, teams have built their parser specific to their document type. They are looking for solutions that can provide more accurate and reliable extractions. Transforming is where these products diverge from traditional pipelines. Transformation for unstructured data includes chunking (breaking up the data into small components), extracting metadata (for indexing), and embedding each chunk (so it can be stored as a vector). The chunking strategy and the embedding model can significantly impact the retrieval's accuracy.

In our research, we found that teams tried many chunking strategies. We’ve also seen vertical specialized embedding models emerge that are trained on domain-specific data, like code or legal content. That data is then stored in a vector-compatible database. Several tools enable companies to get their data in a queryable format so they can personalize LLMs through RAG and agents. 

2. Retrieval-augmented generation (RAG)

Retrieval Augmented Generation (RAG) is an architectural workflow that can improve the efficacy of LLM applications by leveraging custom data. In RAG, data is loaded and prepared for queries or "indexed." Queries act on the index, which filters the data down to the most relevant context. This context and the query then go to the LLM along with a prompt, and the LLM provides a response. RAG enables data to be activated as part of a product experience.

Using RAG has many benefits. LLMs are limited to their pre-trained knowledge and data. This could lead to outdated or inaccurate responses. RAG overcomes this by granting LLMs access to external information sources for up-to-date answers. LLMs can sometimes struggle with factual accuracy, and RAG can help address this issue by providing LLMs with access to a curated knowledge base to help with grounding. RAG enables models sources they can cite, like footnotes that build end-user trust. 

LlamaIndex diagram of how RAG works
Source: LlamaIndex

3. Data curation for training and inference improvements

Data curation is the filtering and organizing of a dataset for optimal training and inference performance. This process includes text classification, NSFW filters, deduplication, batch size optimization, and performance-based optimization of different sources. The last piece of curation is augmentation with synthetic data. 

Two quotes from Meta’s Llama-3 announcement give insight into our belief in data curation:

On training: “To train the best language model, the curation of a large, high-quality training dataset is paramount…To ensure Llama 3 is trained on data of the highest quality, we developed a series of data-filtering pipelines. These pipelines include using heuristic filters, NSFW filters, semantic deduplication approaches, and text classifiers to predict data quality.”

On fine-tuning: “Some of our biggest improvements in model quality came from carefully curating this data and performing multiple rounds of quality assurance on annotations provided by human annotators.”

Meta’s AI Research team published a paper last year detailing how data curation can speed up training times by up to 20% and improve downstream accuracy. Perhaps, more importantly, the paper calls out a path towards model improvements as model companies run out of internet data for training. 

All companies training and fine-tuning models will want access to automatic high-quality data filters, deduplication, and classifiers. One of the authors of that paper, Ari Marcos, founded Datology AI in pursuit of this vision, and we’re excited to see him bring it to life. 

4. Data storage for AI

Three trends are driving data storage for AI: vector stores, the rise of the data lake, and investment in the lakehouse. 

Vector databases have been one of the darlings of the AI boom. This is due to the importance of vector databases' ability to store embeddings or numerical representations of data - including unstructured data.

Here is some quick background on vector databases:

Vectors in mathematics have both magnitude and direction, making them suitable for representing values in space. In AI, vectors are numerical representations of data, enabling the conversion of unstructured data like images, audio, and video into meaningful numbers, which are stored in vector databases. Vector embeddings are created from this data for semantic retrieval of related terms, such as finding "wolf" or "puppy" when querying "dog."

Vector databases come in two forms: 1) native vector databases that are purpose-built and 2) existing databases that have augmented with vector support. Vector databases grew in popularity due to their ability to personalize LLMs. A company can store its custom data as embeddings in vector databases that can be retrieved for personalized experiences. AI agents can also leverage this architecture.

Star history for several open source data projects

The other trend for AI data storage is the rise of the data lakehouse. Since most enterprises store a large amount of data in data lakes, custom AI requires using that data. Data lakehouses provide an architecture for managing and querying data in the data lake. This starts with organizing data using open table formats like Iceberg, Delta Lake, or Hudi. Databricks’ acquisition of Tabular was important because it combined the creators of the two largest open table formats (Delta Lake & Iceberg) while preventing competitors from entering the space easily.

5. AI Memory

Since ChatGPT released memory, AI memory has become a hot topic. Standard AI systems lack robust episodic memory and continuity across distinct interactions. They essentially have amnesia. This isolated, short-term memory hinders complex sequential reasoning and knowledge sharing in multi-agent systems. 

As we move to multi-agent systems, a robust system for memory management across different agents that also enforces access and privacy controls needs to exist. Each agent's memories should be stored and accessed during and across sessions. More complicated memory mechanisms will need to exist, like pooling memories among agents, which improves an agent’s decision-making ability since one agent can benefit from other agents’ experience. Memory storage will need to be hierarchical based on frequency accessed, importance, and cost.

MemGPT is a leading open-source framework for memory management today, and their vision is for LLMs to act as the next evolution of operating systems. Their basic architecture is described as follows:

MemGPT’s OS-inspired multi-level memory architecture delineates between two primary memory types: main context (analogous to main memory/physical memory/RAM) and external context (analogous to disk memory/disk storage).

Innovations in memory will be critical to advancing AI applications as it will help with personalization, learning, and reflection. 

The opportunity in AI workloads

While not all aspects of data infrastructure have changed with the rise of GenAI, it has been exciting to see new technologies emerge across unstructured data extraction and pipelining, retrieval-augmented generation (RAG), data curation, data storage, and AI memory.

At Felicis, we've been committed to the future of the data and infra layer as it relates to AI. This is why we've invested in Datology (data curation), Metaplane (data observability), and MotherDuck (serverless data warehouses), among other related tools like Weights & Biases for experiment tracking.

There is a large and growing market for AI. From chatbots to multi-agent workflows, we are just at the beginning. Data solutions are critical to making these applications successful. Enormous data businesses will be built to support AI workloads. 

If you’re building in this space and interested in chatting, please contact data@felicis.com!