How Unstructured Data is Powering the Future of AI

Sep 18, 2024

$672 million — that’s how much Reddit could generate in annual revenue by 2027 from licensing its text data for generative AI, according to one leading equity research firm. A few years ago, Reddit’s text data was worth comparatively little. Now, that same asset has had a transformative effect on the company.

What drove this massive and rapid change? AI has reshaped the landscape on the value of different types of data. Data that was unusable at scale is suddenly some of our most valuable. 

Historically, we could only do broad analysis of structured data. That’s the type of data that can largely fit into a spreadsheet, ranging from global weather data (temperature highs and lows in a given day) to stock performance to your lab results. When you look at flight booking websites and they tell you if the price you’re seeing is high, low, or average, that’s because behind the scenes, they have a big database of structured data.

But structured data is only a small fraction of the total data in the world. The vast majority of data is unstructured – in emails, slide decks, online forums, social media clips, textbooks, etc. A lot of interesting questions could be investigated with this data, but it has been difficult to work with and analyze at scale. No Excel function can find the average of an email or predict the next photo in an album. Advances in technology – data storage, processing power, models – are now making it possible to analyze unstructured data from disparate sources at an unprecedented scale. With the rise of GenAI and Large Language Models, we’re also seeing a fundamental shift in demand for this data.

GenAI models are built to resemble human understanding and communication. In order to create unstructured output (i.e. how we communicate the vast majority of the time), the models need to be trained on unstructured inputs. These are often multimodal in nature (think images, audio, and text). Once provided with these inputs, they generate responses in natural language by predicting the next word or sequence of words, drawing on patterns that they have learned from the unstructured training data. In order for those predictions to be strong, the models need to be trained on massive amounts of data, hence the $672 million prediction for Reddit’s billions of unstructured user-contributed posts and comments.

GenAI (and AI broadly) represents a paradigm shift in how we think about data, creating the largest opportunity for data owners ever. Data that was previously ignored – because it did not fit neatly into analytical tools or was considered too messy – now has real commercial value. Companies now need to think more than ever about how to enable safe, compliant data usage, in part because unstructured data is often human generated and quite sensitive. A greater deal of prudence is required, in the form of notices, transparency and opt-outs for users, and de-identification where needed.

Here are just a few ideas for how unstructured data could create value in healthcare, education, customer service, agriculture, and law:

  • Clinician and nurse notes, voice recordings, and diagnostic images - Unlock faster and more precise diagnosis and more effective treatments for patients in healthcare.

  • Lecture recordings, chats, and emails between students and teachers - Enable AI-based personalized tutoring in education, tailoring the curriculum to each individual student’s needs.

  • Raw data from customer service calls - Enable chat bots to provide more personalized and efficient customer experiences: no more call wait times.

  • Videos and images of agricultural plots with crop yield data - Allow drones with computer vision to monitor crops to improve plant health and yield.

  • Contracts with redlines from lawyers - Empower law associates with tools that allow them to draft legal documents more efficiently.

The shift towards valuing unstructured data is a game changer, and it’s a sign of what’s to come. Every organization is the owner of proprietary, human-generated data covering every imaginable topic. Every company is a training data company. The rise of GenAI represents a unique opportunity for a new set of organizations to unlock immense value.

Many thanks to Josh Miller and Erik Duhaime for their suggestions putting this together.