Introducing Protege — Empowering Data Holders to Safely License Training Data to AI Developers

May 31, 2024

There are three foundational bottlenecks to developing AI: algorithms, computational power, and data. While the first two have robust markets around them, the process of obtaining data for training AI is currently a wild west — suboptimal for both owners of data/content as well as developers of AI. Shaper Capital is thrilled to launch Protege to address this, alongside Bobby Samuels, Richard Ho, Ray Shi, and Engy Ziedan.

***

The process of getting the right data for training purposes currently ranges from arduous to impossible. Early-stage AI companies sometimes require millions of dollars and years negotiating access to the right dataset; larger LLM companies are seeking every piece of rich data they can find and often missing proprietary data sets. Meanwhile, data providers (ranging from Reddit to textbook companies to hospital systems) want to license data but don’t know where to start and are rightly concerned about the privacy, security, and IP implications of letting companies build models on top of their data. So data remains illiquid, even though there are eager buyers and sellers.

This isn’t the right industrial organization. The market needs a platform to connect data buyers and sellers in a way that puts control into the hands of data sources and helps them manage the privacy, security, and downstream usage of data. This would jumpstart the AI economy, opening up new opportunities that previously weren’t possible and dramatically lowering the cost (both in dollars and time) in building AI. The winners will be both the data buyers building the models and the data holders too.

To solve this problem, I’m excited to announce Protege. Protege will be the data layer helping to unlock private training data sets for AI.

We believe this layer will be a critical part of the AI stack. Every organization building an AI application or foundational model needs to look externally for data. This problem is not specific to any one industry. From healthcare to agriculture to marketing to finance, similar dynamics exist. The solution needed should span all industries.

A core part of Protege’s ethos is a belief in source-centricity. Data sources have deep concerns about their data being misused by AI models, ranging from privacy & security violations to unauthorized use of their IP by derivative models. We see our role as ensuring sources have complete control of their data and have confidence in the safety of how it’s used. And we began with a privacy and security review before we wrote a single line of code.

AI today is where the internet was in 1995, and we expect orders of magnitudes of growth in the coming years and decades. A data layer will be one of the critical parts of the infrastructure for the coming boom, and the winner will be a massive, incredibly impactful company.

In the coming weeks, we will announce our first vertical and will rapidly scale to more verticals shortly after. In the meantime, we are hiring engineers, superstar generalists, and GM’s for new verticals. We’d love to hear from you if it sounds like a fit at withprotege.ai/careers.

‹ The Urgent Need for More Training Data