No, We're Not Running Out of Training Data
Apr 1, 2025
Concerns about an imminent shortage of training data for AI have recently gained traction. Influential figures like Elon Musk, Ilya Sutskever, and Marc Andreessen, along with major outlets such as Nature, Forbes, the Associated Press, and The New York Times, have warned that limitations in available data could slow AI progress. Many of these concerns stem from an Epoch AI study predicting that we could run out of data for training AI models as early as 2028.
Training data is essential for AI development because it enables models to learn patterns, recognize relationships, and make accurate predictions. AI systems rely on vast amounts of high-quality data to improve their performance, ensuring they can effectively complete complex tasks ranging from detecting fraud to optimizing logistics. For example, a shortage of high-quality training data in healthcare could delay life-saving therapies and weaken diagnostic models, while in finance, economic forecasting and fraud detection would suffer. Without sufficient training data, AI-driven innovation would slow dramatically, with implications across industries.
But the fear of there not being enough data is misplaced: AI training data will not become scarce anytime soon.
While public datasets may run out, vast amounts of proprietary data will become increasingly available — if we can figure out how to responsibly unlock and use it at scale.
The vast majority of digitized human knowledge is privately owned, stored in restricted networks, proprietary databases, emails, and industry archives. Most industries generate massive amounts of data that are largely underutilized. Consider healthcare: hospitals generate enormous amounts of patient data and medical imaging that could dramatically accelerate disease detection and treatment — yet much of this data is off-limits due to privacy, regulatory, and commercial challenges. In finance, firms collect vast amounts of transactional and market data that could refine economic forecasting and fraud detection, but security and competitive concerns keep it siloed. Media platforms, manufacturing sensors, and even personal digital footprints contain valuable data, yet legal and logistical barriers prevent widespread AI training on these resources.
While privacy, security, and technical challenges limit access to valuable private data, the biggest hurdle is commercial. There is no established system for companies to securely share proprietary data. Without a structured system for data sharing, vast amounts of valuable information will stay locked away, stifling AI-driven advancements and slowing innovation across industries.
The real challenge isn’t that we lack data — it’s that we lack the infrastructure to share and use it effectively. Given the massive incentives that exist to solve these challenges, the market will solve it in some way — and a massive amount of data will become newly-available (while, most-likely, the owners of those data sets will become compensated).
Solving this bottleneck requires unlocking untapped data sources and integrating fragmented datasets. Organizations must consolidate their data and establish secure ways to share it while maintaining compliance and protecting privacy. Additionally, AI models depend on high-quality, well-prepared data to recognize patterns, make accurate predictions, and generate reliable insights. Since much of the existing raw data is unstructured, inconsistent, or incomplete, it will require cleaning, organization, and labeling to be useful. By addressing these challenges, we will ensure AI models have the reliable, high-quality data they need to drive innovation.
The AI industry is on the brink of remarkable breakthroughs. The key to unlocking this potential lies not in collecting more data, but in developing smarter strategies to access and leverage the vast amounts of data that already exist. Companies that innovate in ethically and efficiently tapping into proprietary datasets will undoubtedly lead the next wave of AI progress. The real opportunity isn’t a lack of data — it’s in our ability to successfully harness it. With the right approach, there are zetabytes of information just waiting to be unlocked.
Acknowledgements:
Special thanks to Kristen Chapey for drafting this piece, to Travis May for invaluable feedback, and to the many invaluable resources we consulted to pull together this work.
—————
View this article and subscribe to future updates on our substack: https://withprotege.substack.com/p/no-were-not-running-out-of-training