Grass - A Data Revolution
By Ed Roman, Co-founder and Managing Partner at Hack VC
At Hack VC, we’re proud to have led the Series A financing round for Grass, a decentralized network that rewards users for their unused Internet bandwidth. The following is our investment thesis on Grass.
Executive Summary
Generative AI is the most important innovation in recent memory and is becoming even more important as time progresses. Generative AI is basically a product of three elements:
Algorithms + Data + Compute = Intelligence
This means that Data and Compute will likely become two of the world’s most important assets, and access to them will be incredibly important.
Generative AI models are data-hungry. The Data that the most significant Generative AI models operate on is the Internet worth of data, which is an approximation for the sum of all human knowledge.
Crypto is all about giving access to new digital resources around the world and asset-izing things that weren't assets before via tokens. Grass does this for Data.
Grass gives AI models and apps access to the entire Internet as a dataset, live, which is collected via a network of nodes around the world who are contributing their idle Internet bandwidth. They have strong initial traction with over 2.5 million users.[1]
The long-term potential market for Grass is massive and is relative to the size of the AI market and its future growth. In the past, gathering datasets of this scale was relegated to only the largest of tech giants. Grass brings new economics to data, driving down costs. This democratizes data access to not just serve elite large companies, but the longer-tail of the AI industry.
The Problem
AI model training and fine-tuning requires enormous amounts of data. Historically, much of that data has been gathered via AI model creators scraping data from websites. This process of scraping has a number of challenges:
- Web scraping is costly. There’s only a couple of large organizations who are capable of scraping the entire web periodically. This locks out smaller AI developers from accessing data.
- IP blocking. There’s been a cat-and-mouse game between those scraping services and the content creators. It’s fairly straightforward to block an IP address to stop scraping, making it difficult to achieve scraping objectives and gather the required data for AI training and fine-tuning.
- Wasted resources. Scraping the web is a task that can benefit many customers. The hardware, bandwidth, and compute power needed for this is inefficient if done by a single customer.
- Data freshness. It’s cumbersome and expensive to scan the entire Internet. This makes it impractical for most users to scan often, which makes data less fresh/recent, impacting the quality of AI models.
Grass’ Solution
Grass aims to solve these problems by creating a federated network of web scrapers. Each individual participating in the Grass network contributes a portion of their unused Internet bandwidth to provide a small amount of scraping from their IP address. Grass then assembles data from each of these nodes to form a combined dataset that’s useful for AI training and fine-tuning. It’s an elegant and fitting use of distributed networks powered by cryptocurrency.
There are other business cases for unused Internet as well, such as:
- Gathering local/geo data, such as ads
- Performing academic research
- Checking local prices
Today Grass gathers data using existing hardware (laptops, desktops, etc.). In the future, Grass plans to offer a data gathering appliance, which is a custom hardware device solely dedicated to data gathering, creating efficiencies due to the appliance being optimized for that particular task.
Grass’ Benefits
There are several benefits to using a distributed network for data gathering:
- Democratized access to web data that becomes cheaper at scale. Rather than a single customer gathering data for their own needs, Grass gathers data on behalf of many customers. This data can be resold multiple times, creating economies of scale on data, driving down the economic costs of scraping and making the market more efficient. At scale Grass can hypothetically become the most cost-effective data gathering solution for customers, creating an economic network effect around their protocol. This means data gathering is now available to anyone, not just a couple of large companies that have the resources to scrape the web.
- IP blocking becomes infeasible. By distributing the scraping, it becomes much more difficult to detect and stop the scraping, since each node only does a relatively minor amount of data capture and is hard to distinguish from typical Internet traffic. This results in more complete datasets for training.
- Internet bandwidth is used more efficiently. Since Grass is effectively a collaborative consumption play on unused Internet bandwidth, it’s more efficient than provisioning new bandwidth just for scraping.
- The data is more accurate and recent. It becomes cost effective to scrape more frequently than a typical customer might do on their own. This results in less stale data. This matters since the resulting AI models are more up-to-date.
The Challenge: Content Creators Who Monetize Their Data
One of the tricky things to navigate when scraping data is content creators. This includes sites such as the NY Times and Reddit, who have started to monetize their data by licensing it to third parties for training AI models. They are naturally protective of the data on their sites since that data represents highly lucrative revenue streams for them. Indeed, Reddit has forbidden their developer API to be used for machine learning to protect their business model of licensing their data to AI model creators (see terms of service here).
What does the future hold for content creators? Well, consider that for user-generated content (UGC), such as Reddit, there’s an argument that users own their own data (rather than the platform), since the content was created by users and should be owned by those users. This argument has yet to be fully explored from a legal point of view. It will be interesting to keep an eye on this going forward. However, if users do indeed own their contributed data, then Grass could represent a hypothetical pathway to help those users monetize their own contributed data. For example, Grass could reward the Reddit contributors themselves for volunteering to contribute their data that they’ve created on Reddit.
For paid content creators such as the NY Times, content is created by paid writers, and as such there is no argument for user-owned data. Thus, Grass could simply exclude those sites from being scraped. Alternatively, Grass may scale to the point where it becomes feasible for Grass itself to become a customer of those sites and pay licensing fees. The way this could hypothetically work is that Grass’ customers could pay for data, and then Grass could revenue share back to the content creators, thus enabling AI model creation on a flexible budget. Alternatively, Grass could achieve such a scale that it could negotiate a bulk licensing deal on behalf of all its customers.
Grass’ Launch
Grass had an extremely impressive launch earlier this year:
- Grass had the most widely distributed airdrop in Solana’s history.[2]
- Over 2 million wallets claimed the airdrop, causing Solana’s network to buckle under pressure.
- There are over 2.5 million total users of Grass worldwide.[3]
- Grass has the capacity and data to train OpenAI’s ChatGPT 3.5 model already.
- As a demonstration of their platform, Grass has open-sourced a dataset consisting of 600 million posts and comments from 2024 on Reddit (see here for the announcement and here for the dataset).
As of writing, the Grass token had positive price action post-launch (+115%), which is unusual as most tokens drop in the days/weeks following listing. This is likely a reflection of their smart approach towards airdrop distribution, as well as belief in the future and potential of Grass. Overall this is a great start to the network and we believe it paves the way for many prosperous years to come.
Grass’ Token Performance Since Launch on October 28, 2024
Source: TradingView.
Start contributing your unused Internet bandwidth by connecting your Solana wallet and earn the Grass token.
Want to use Grass’ datasets for your business, research, or project? Contact the team at discover@grassfoundation.io.
Footnotes
[1] Source: https://www.getgrass.io/.
[2] Source: https://www.google.com/url?q=https://www.theblock.co/post/323805/grass-becomes-most-distributed-solana-airdrop-as-nearly-1-5-million-addresses-claim-tokens&sa=D&source=docs&ust=1732646335082707&usg=AOvVaw0oVvhJL661rmE1ABmJqOyP.
[3] Source: https://www.getgrass.io/.
Disclosures
The information herein is for general information purposes only and does not, and is not intended to, constitute investment advice and should not be used in the evaluation of any investment decision. Such information should not be relied upon for accounting, legal, tax, business, investment, or other relevant advice. You should consult your own advisers, including your own counsel, for accounting, legal, tax, business, investment, or other relevant advice, including with respect to anything discussed herein.
This post reflects the current opinions of the author(s) and is not made on behalf of Hack VC or its affiliates, including any funds managed by Hack VC, and does not necessarily reflect the opinions of Hack VC, its affiliates, including its general partner affiliates, or any other individuals associated with Hack VC. Certain information contained herein has been obtained from published sources and/or prepared by third parties and in certain cases has not been updated through the date hereof. While such sources are believed to be reliable, neither Hack VC, its affiliates, including its general partner affiliates, or any other individuals associated with Hack VC are making representations as to their accuracy or completeness, and they should not be relied on as such or be the basis for an accounting, legal, tax, business, investment, or other decision. The information herein does not purport to be complete and is subject to change and Hack VC does not have any obligation to update such information or make any notification if such information becomes inaccurate.
Past performance is not necessarily indicative of future results. Any forward-looking statements made herein are based on certain assumptions and analyses made by the author(s) in light of their experience and perception of historical trends, current conditions, and expected future developments, as well as other factors they believe are appropriate under the circumstances. Such statements are not guarantees of future performance and are subject to certain risks, uncertainties, and assumptions that are difficult to predict.