SBN

How Google Bard Utilizes Your Business’ Content (For Free)

29

While everyone is talking about OpenAI and ChatGPT, Google has also been working hard on its own large language model (LLM): Bard. Google Bard is based on the LaMDA language model, which was trained “on a dataset of 1.56T words from public dialog data and other public web documents”. However, Google’s LaMDA research paper does not go into detail about the origin of the public data—and how it was collected.

It’s possible that Google’s LLMs are, or will be, trained using data collected by Googlebot scrapers. As Googlebot indexes websites for Google search results, blocking it in most cases would be unwise. Website owners often rely heavily on the Google search engine to drive traffic to their website. Thus, blocking Googlebot would result in a serious drop of visitors.

You may be wondering how you can prevent Google LLMs and AI tools such as Bard and Vertex AI from using your website data for training. Perhaps you want to know how to stop users from obtaining responses that can only be found on your website, as this could negatively impact your business. We explain how businesses can opt out of having their data used by the Google LLMs and generative AI products.

What is Google Bard?

You can think of Bard as Google’s competitor to ChatGPT. You can interact with it through a chat UI and ask your questions. For example, on the screenshot below we asked Bard to explain how DataDome’s solution protects businesses from malicious bot and fraud attacks.

In its answer, Google Bard first summarizes what bots are, and then explains how DataDome detects and blocks bots in real time.

How Google Bard Utilizes Your Business' Content (For Free)

How can I opt out of Google Bard’s LLM?

When it comes to large language models (LLMs), opting out can mean two things:

  • You don’t want your data to be used to train the LLM.
  • You don’t want Bard users to make queries on your website’s content without them actually visiting your site.

The difference between the two may seem subtle, but it is fundamental to the issue at hand. LLMs are trained on huge volumes of data. If your data are not included in the LLM training dataset, the LLM will answer less accurately for questions whose answer was only located on your website and was difficult to generalize from other sources in the training dataset.

However, there is nothing preventing the LLM interface UI—here, the Bard chat UI—to dynamically fetch content from URLs/pages in response to user queries, and to dynamically feed the content it retrieves to the LLM. Thus, even though the content of your site was not originally used to train the LLM, the LLM may still be able to use it to improve the quality of its inference. We discussed this use-case in the context of ChatGPT plugins.

Opting Out of Google Bard & Google Vertex AI

Google provides information on their developer website about crawlers such as the Googlebot, as well as other crawlers used by Google to collect information on the web. It can be helpful for websites to safely identify real Googlebots. Indeed, as we explained in a previous article, ~30% of traffic with the Googlebot user-agent is fake Googlebot traffic.

When it comes to Bard and other generative AI products such as Vertex AI, Google introduced a standalone product token named Google-Extended. It can be used by websites to control whether or not they want their data to be used in Google LLMs and other AI products.

How Google Bard Utilizes Your Business' Content (For Free)

As mentioned in their documentation, while Google mentions a user-agent token whose value is Google-Extended, the crawler doesn’t have a separate HTTP request user-agent. The value is used in a control capacity in the robots.txt file.

Thus, if you want to opt out of Bard and Vertex AI training data, don’t search for the Google-Extended user-agent. Instead, you should update your robots.txt file as shown below:

User-agent: Google-Extended
Disallow: /

Conclusion

As generative AI and LLMs become more and more prevalent, the big AI and web players start to provide mechanisms to opt out of their training data. However, there are no standards yet, and you will need to adapt your approach for AI tools like Google Bard and Bing’s GPT integration.

Moreover, while big players may agree to disclose their presence when scraping your website, this may not be the case for all of the AI startups and companies gathering data to build the next big LLMs that will compete with ChatGPT.

In this case, the only solution to block scrapers that do not disclose their presence is to use a bot detection product that can catch—in real time—bots that are trying to evade detection by forging their fingerprints or changing their IP address using proxies. Advanced solutions like DataDome’s bot and online fraud protection leverage AI and machine learning (ML) to detect and stop unfamiliar bots from the very first request, giving you peace of mind, and keeping your content from being used in LLM datasets without your permission.

*** This is a Security Bloggers Network syndicated blog from DataDome authored by Antoine Vastel. Read the original post at: https://datadome.co/threat-research/how-google-bard-utilizes-your-content/