October 22, 2024

The Three Components of the Unstructured Data Stack

Astasia Myers

Unstructured data refers to information that does not follow a predefined data model or schema. It encompasses a variety of formats, such as text, images, audio, and video. The amount and use of unstructured data is only going to expand:

Gartner estimates that about 80-90% of all new data generated is unstructured.
IDC predicts the global volume of data will reach 175 zettabytes by 2025, and much of this will be unstructured.
Today only half of an organization’s unstructured data is analyzed to extract value, and only 58% of unstructured data is ever reused more than once after its initial use. This will dramatically change.

As enthusiasm for GenAI has increased, so too has the understanding that its success relies on strategically utilizing an organization’s unstructured data. Unstructured data powers training and fine-tuning GenAI models, Retrieval-Augmented Generation (RAG) search and AI agents, and contextual analytics. With the rise of GenAI use cases and the explosion of unstructured data itself, we believe unstructured data is entering a golden age and will be used considerably more in the future. Unstructured data has become a veritable gold mine of information and a significant focus for organizations seeking to leverage it for strategic advantage. Unstructured data’s attributes means it doesn’t fit well within the traditional data infrastructure stacks. A new unstructured data stack is emerging and consists of three crucial components: data extraction and ingestion, data processing, and data management. Each part plays a vital role in deriving value from unstructured data in the age of AI.

‍

This image presents the "Unstructured Data Landscape" organized by different categories of tools and platforms. Here's a breakdown: Data Extraction & Ingestion: Includes tools like LlamaIndex, Reducto, Tensorlake, Unstructured, and Vectorize. Data Processing: Features platforms such as Daft, MODIN, Pandas, Polars, Ray, and Apache Spark. Data Lakes: Highlights Amazon S3, Databricks, Onehouse, and Tabular. Vector Database: Includes Chroma, Pinecone, and Supabase. Document Database: Lists Amazon DynamoDB, MongoDB, and Stately. Graph Database: Showcases FalkorDB, Neo4j, and TigerGraph. File Format: Features LanceDB, Spiral, and Meta Nimble. This image is presented by Felicis, a venture capital firm and General Partner Astasia Myers.

‍

Data extraction and ingestion

The first stage of the unstructured data stack is data extraction and ingestion. Teams must capture, extract, transform, and optimize the data for storage. This process involves identifying and capturing unstructured data from various sources, such as social media platforms, customer feedback, emails, and more. Techniques such as web scraping, API integration, and file parsing are commonly employed to facilitate this extraction. Teams will build their own extractors for particular data types or use pre-built extractors to achieve high extraction accuracy rates. Since text has been the main modality of LLMs, we have seen an explosion of document parsers, including Reducto, Tensorlake, Unstructured.io, LlamaParse, Vectorize, among others. Unlike the previous generation of Intelligent Document Processing (IDP) services that used Optical Character Recognition (OCR), these new solutions leverage vision models to improve parsing accuracy.

After extraction, to make the data usable for AI use cases like RAG, data must be partitioned into smaller, logical units by semantically meaningful context. Then the partitioned results are written to a structured machine-readable format like JSON. This makes the data usable for additional preprocessing like cleaning. Teams will chunk the document into segments and generate summaries of the chunks to improve the retrieval performance. For teams that want to use unstructured data for RAG, they will generate embeddings, a process that involves using ML models to represent text as vector strings. Embeddings allow for text to be searched based on semantic similarity. The specific chunking and embedding strategy can greatly impact retrieval performance, so solutions often enable testing of different approaches. Then the data is written to a destination storage system like an object storage data lake or database.

When speaking with buyers, they stated a number of considerations for data extraction and ingestion solutions.

First, the accuracy of extraction was the top factor because if there is missing data or inaccurate data, it negatively impacts the utility of the data.
Second, when real-time decisions are made based on unstructured data the speed of extraction is important for high data volumes.
Third, the support of multimodal data extraction and transformation. While the vast majority of teams are focused on leveraging text data, a number were seeking a unified solution that would support other modalities like images, video, and audio.

Interestingly, customers said they had a higher willingness to pay for unstructured data pipelines than structured data pipelines because the extraction process is more complicated and challenging.

‍

Data processing

Once the data is ingested, it moves to the data processing stage, where it can be further transformed into a more usable format and analyzed. This stage can involve additional data cleansing and normalization, ensuring the data is accurate and consistent. Data frame libraries and data processing engines are leveraged to complete the work. Use cases can be offline data exploration, preparing data for AI training, analytics, and data loading functions. By applying advanced processing techniques, organizations can convert vast amounts of unstructured data into structured insights that are easily interpretable.

Data processing engines can be categorized across a few dimensions: structured data-oriented vs. unstructured data-oriented; single node vs. distributed; and SQL vs. Python. Most data processing engines, like Spark, Dask, and Modin, are oriented towards supporting structured data. They don’t have strong first-party support for unstructured data. When speaking with buyers, they are excited about emerging technologies like Daft that can efficiently handle multimodal data in a distributed fashion.

Single-node solutions like Polars and Pandas rose to improve offline data exploration and preparation. When moving into production with medium and large datasets, these architectures could struggle with performance and memory usage. This contrasts distributed solutions like Spark that focus on large-scale data processing. We heard from buyers that they are looking for solutions that let them quickly scale from single node to multiple nodes without requiring them to rewrite code or set-up a cluster from the start. They believe an end-to-end platform would improve consistency, management, and productivity. Furthermore, Python is the lingua franca of AI, so customers want an unstructured data processing engine that is Python-native so the AI builders can easily adopt it.

‍

Data management

The final component of the unstructured data stack is data management, which encompasses the organization, storage, and governance of data. Effective data management ensures that unstructured data is stored in a manner that allows for easy retrieval and analysis. This involves choosing appropriate storage solutions, such as data lakes, that can handle the volume and variety of unstructured data.

We have also seen the rise of vector, graph, and document databases in AI stacks. Vector databases gained popularity because they offer efficient data structures and algorithms for searching vectors (mathematical representations of data) by distance. Examples include Supabase (a Felicis investment), Chroma, and Pinecone. With Graph RAG techniques, graph databases like Neo4j, TigerGraph, and FalkorDB that store data as nodes and relationships could be applied to a new use case. Since teams have text data in documents, document databases that store and query data as JSON are seeing adoption including MongoDB and Stately.

Additionally, data governance practices are essential to maintain compliance with regulations and ensure data security. This includes establishing policies for data access, usage, and privacy, thereby protecting sensitive information while enabling data-driven decision-making.

While Apache Parquet, an open source, column-oriented data file format on object storage, has become an industry standard, we’ve seen the rise of new data storage formats targeting unstructured data use cases. Existing column formats are not particularly efficient for AI/ML workloads for several reasons.

First, semantic search uses point lookups, a query that accesses a small set of rows. Parquet’s challenge with point lookups is that its encodings are not designed to be sliceable, so an entire page of data must be loaded to access a single row, which hurts performance.
Second, in unstructured data use cases there are often wide columns of semantic search embeddings or images compared to traditional database workloads with relatively small columns.
Third, some unstructured datasets have thousands of columns. Parquet requires readers to load the schema metadata for all columns, which can be overkill.
Fourth, there are new encoding styles for data, but Parquet only provides a fixed set of data encoders.
Fifth, Parquet only allows encoders to store metadata in pages, yet unstructured data could benefit from storing metadata at the row or file level.

To address these limitations, we’ve seen the rise of Parquet alternatives like Lance, Meta’s Nimble, and Spiral’s Vortex. Each focuses on a different area for improvement. Lance enhances random access reads and has started adding compression support. Nimble focuses on wide table support and on scan performance for large data volumes while roughly maintaining size and write throughput. Spiral’s Vortex improved random access reads and scans while maintaining compression ratio and write throughput.

‍

It’s unstructured data’s time to shine

The unstructured data stack is critical for organizations aiming to harness the power of unstructured data for GenAI, analytics, and automation. Businesses can drive innovation and competitive advantage by understanding and effectively managing the three components—data extraction and ingestion, data processing, and data management.

Data extraction and ingestion technologies are moving towards using vision models while data processing engines are becoming multimodal data-native. Teams now have more choices when deciding on their unstructured data management strategy including data lakes, vector, graph, and document databases. New file formats are surfacing to improve read and scan efficiency. As unstructured data grows, the importance of a robust data stack becomes increasingly evident, allowing organizations to transform data into actionable intelligence and strategic resources.

We are incredibly excited about the unstructured data’s new golden age. The Total Addressable Market (TAM) for unstructured data technology is large and expanding, estimated between $15-30B, according to equity research reports from Wells Fargo and William Blair. We strongly believe a massive unstructured data infrastructure business will emerge.

If you or someone you know is working on an unstructured data startup or adjacent offering, I’d love to hear from you. Email me at astasia@felicis.com.