How ChatGPT & OpenAI Might Use Your Content, Now & in the Future
Everyone is talking about ChatGPT and OpenAI. The latest version of ChatGPT shows impressive capabilities and can answer a wide range of questions—because it’s been trained on massive amounts of data across the internet.
But AI tools like ChatGPT raise ethical issues. ChatGPT, and most similar AI tools, obtain their training data through scraping. The scraped data can come from any unprotected website, and website owners may not intend for their content to be used—especially for monetization purposes.
As users get answers directly from ChatGPT, they are less likely to navigate to the original source (i.e. the website the data was taken from). Therefore, having ChatGPT serve up information taken from your website to its users decreases the volume of visitors your pages would have gotten otherwise.
It’s understandable that some websites want to opt out of allowing AI tools like ChatGPT to train models using their data. Other websites, such as StackOverflow, may decide to monetize their data to try and get a slice of the proverbial AI cake.
What data does ChatGPT train on?
According to Language Models are Few-Shot Learners, a research paper published by OpenAI, ChatGPT3 was trained on several datasets:
- Common Crawl
- WebText2
- Books1 and Books2
- Wikipedia
As indicated by the table above, the largest amount of training data comes from Common Crawl, a nonprofit organization that provides access to web information by producing and maintaining an open repository of web crawl data. Their different crawls are available on AWS 3, and as of May 2023, they provide access to dozens of datasets spanning from summer 2013 to April 2023.
Common Crawl Crawler, aka CCBot
The Common Crawl project’s crawler is named CCBot and leverages Apache Nutch, a framework that enables developers to build large scale scrapers.
The most current version of CCBot identifies itself with a user-agent of CCBot/2.0
. However, if you want to allow CCBot, you should not rely solely on the user-agent to identify it. Remember, a lot of bad bots frequently spoof their user-agents to pretend to be good bots and avoid being blocked.
To allow CCBot on your website, use other attributes such as IP ranges or reverse DNS. Old versions of CCBot used the IPs 38.107.191.66 through 38.107.191.119, while the current version crawls from Amazon AWS.
According to Common Crawl, “The CCBot crawler has a number of algorithms designed to prevent undue load on web servers for a given domain.”
How do I prevent ChatGPT from accessing my website?
The majority of ChatGPT’s training data comes from the Common Crawl crawler bot. So to block ChatGPT, your website should, at minimum, block traffic from CCBot.
Robots.txt
CCBot respects robots.txt files, and can be blocked with the following lines of code:
User-agent: CCBot Disallow: /
Blocking CCBot User-Agent
Another option is simply to block the CCBot user-agent. While allowing good bot traffic through user-agent is unsafe, you can safely block an unwanted bot through user-agent—which can’t be abused by attackers.
Bot Management Software
The best way to prevent scraping of your data for any purpose is by stopping bots from scraping in the first place. A powerful bot and fraud management software (like DataDome) can keep bad bots—and even just unwanted bots—out using powerful machine learning algorithms.
Are there other ChatGPT/OpenAI scrapers?
Some bots tied to AI tools are scrapers, and others are just plugins not actively searching for data to take.
ChatGPT-User
You may have seen requests with the following user agent in your logs: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot
According to OpenAI’s documentation, these requests are linked to OpenAI’s web browsing plugin and are not used for scraping purposes, meaning requests made by this bot are not used to train OpenAI models.
Much like the ChatGPT scraper, OpenAI’s bot also respects robots.txt and can be blocked using the following lines in your robots.txt file:
User-agent: ChatGPT-User Disallow: /
Another possibility is to block its user-agent:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot
How can I protect my website data against AI model training in the long term?
ChatGPT and other large language models need data to be trained. As of today, blocking Common Crawl, either using robots.txt or by blocking its user-agent, is enough to opt out of most GPT training. However, in the future it’s unclear if that will change.
If OpenAI is prevented from accessing content on too many websites, the developers may be tempted to stop respecting robots.txt and stop declaring their crawler identity in the user agent. In this case, you will need to apply advanced bot detection techniques to detect and block the AI scrapers, as you’d do for other scrapers.
Another possibility is that, due to OpenAI’s partnership with Microsoft, it could get access to Microsoft Bing’s scraper data. In this case, the situation would be more challenging for website owners. Indeed, while Bing’s bots identify as Bingbot, blocking them could be dangerous, preventing websites from being indexed on the Bing search engine and resulting in a significant drop in human visitors.
What about Bard from Google?
While everyone is talking about ChatGPT, Google is also working on its own ChatGPT competitor: Bard. Google’s Bard is based on the LaMDA language model, which was trained “on a dataset of 1.56T words from public dialog data and other public web documents”.
Google remains quite vague about the origin of the public data, and how it was collected. It’s possible that Google’s large language models are, or will be, trained using data collected by Googlebot scrapers. Blocking Googlebot in most cases would be unwise, as it is how websites get indexed for Google search results. Website owners often rely heavily on the Google search engine to drive traffic to their website. Thus, blocking Googlebot would result in a serious drop of visitors.
What’s the impact of large language models on website traffic?
While there is no long-term study yet on how tools like ChatGPT impact website traffic, we already see users sharing stories of using ChatGPT instead of search engines or websites.
While in the past, users had to utilize search engines or websites to get their answers, they may now be tempted to complete their “full” research using ChatGPT, without visiting external websites. You can imagine how this would significantly impact the number of visitors, potential ad revenue, and other business metrics for online enterprises.
Similar concerns were raised at the time Google introduced Google featured snippets. How could snippets impact website traffic if users didn’t have to visit the website anymore—as they’d get their answer directly from the Google search results page?
Thus, if OpenAI (funded partially by Microsoft) and Google started to leverage Bing and Googlebot scrapers to collect AI training data, business websites will face a dilemma.
Should you:
- Opt out of the data collection process and lose traffic due to no longer being indexed by main search engines?
- Allow your website data to be used to train AI models, but run the risk of losing visitors in the long term, since they can get all of their answers directly inside tools like ChatGPT or search engines?
Of course, the situation is complicated, as companies training AI models also need human-generated data to train their models. Such companies may leverage their existing search engine scrapers and infrastructure to obtain AI training, but could also provide a new type of directive in robots.txt, so website owners could opt out of the AI data training part only.
What should I consider to monetize my website content/API?
Instead of blocking companies trying to train AI models from accessing your public website data, another possibility is to monetize your data, e.g. by providing an API.
What are the potential challenges around monetizing your data through an API?
Monetizing data through an API can be challenging, particularly if the data you are trying to monetize is public. While some companies may be willing to pay for API-returned structured content (with SLAs and low response time), others may be tempted to scrape your website without permission to avoid paying for your API. To prevent unauthorized access, you need to protect your API from unwanted scraping.
Thus, when designing your API pricing plan, it’s important to take into account this possibility.
One way to make it less tempting for people to scrape your website data without permission, instead of paying for your API, is to implement an advanced bot detection solution. A solution like DataDome will continuously monitor requests made to your website, mobile app, and API, and analyze user behavior to block fraudulent traffic. The right solution will work in the background by collecting technical data such as browser fingerprints and applying behavioral analysis, so your human users are not challenged.
What’s next for online businesses?
With the latest improvements in AI, in particular on large language models, obtaining high-quality datasets of human generated content will be of paramount importance. Some websites with valuable data will either want to opt out of AI model training or try to monetize their content.
For the moment, you can opt out of models like ChatGPT by blocking the Common Crawl bot they used to build their training dataset. You can also block ChatGPT plugins to avoid users interacting with your website through ChatGPT plugins by using robots.txt or by blocking the crawler user agent.
However, in the long term, companies like OpenAI (funded partially by Microsoft) and Google may be tempted to use Bing and Google search engine scraper bots to build datasets that can train their large AI models. In this case, it would become more difficult for websites to simply opt out of the data collection process, as most online businesses rely heavily on Bing and Google to index their content and drive traffic to their website.
E-commerce, classified, and similar website and app owners that want to avoid becoming victims of content theft will likely require advanced protection in the near future. Only advanced, adaptive solutions leveraging AI and machine learning (ML) to detect unfamiliar bots and threats can stand a chance against rapidly evolving AI technologies.
*** This is a Security Bloggers Network syndicated blog from DataDome authored by Antoine Vastel, PhD, Head of Research. Read the original post at: https://datadome.co/threat-research/how-chatgpt-openai-might-use-your-content-now-in-the-future/