What You Need to Know About the New Bing GPT Integration
In the past few years, Microsoft has heavily invested in OpenAI, forging a relationship with the company behind the well-known generative language model ChatGPT. We suspected in our first ChatGPT post that companies like OpenAI could be tempted to use Bing or Google search engine scraper bots to gather data to train their large language models (LLMs) like ChatGPT. This integration would make it much harder for businesses to opt out of data collection without negatively impacting their business’ online presence.
Earlier this year, Microsoft announced that AI would be integrated in their search engine, Bing, so that users could interact with it directly from the search engine to ask questions. This feature is called new Bing, available for Microsoft Edge users, and uses GPT-4—the same model as ChatGPT.
You may be wondering how you can prevent the new Bing from using your website data at training, or how to stop users from obtaining responses that can only be found on your website, as this could negatively impact your business. We looked into how the Bing–GPT integration works and how businesses can opt out of having their data used by the new Bing.
How to Access the New Bing From the Edge Browser
If you perform a search on Bing—for example, “what is DataDome”—a ‘Chat’ section next to a blue icon will show up below the search bar.
If you click on it, it opens a new page with the new Bing interface, which is set up like a chat.
Our “what is DataDome” search query was automatically processed by GPT-4 and the new Bing provided a summary of what DataDome is doing: protecting businesses against online fraud and bad bots!
As a user, you can ask any question directly in the new Bing chat UI, and Bing will use GPT-4 to answer your questions—meaning you won’t need to visit the websites directly to get your answer. Note, however, that the new Bing still lists its sources in the “Learn more” section.
How is Bing gathering data for query responses?
In the first popular version of ChatGPT, based on GPT-3, OpenAI was quite transparent about the source of the training data. They don’t provide this information anymore for the latest versions of GPT, as there is no mention of the training dataset in the GPT-4 technical report.
As we predicted a few months ago, it’s highly likely OpenAI is leveraging its relationship with Bing to use the data collected by Bingbot—the scraper used by Bing to index the web—to gather training data at scale for training their LLMs.
The reason we argue this is highly likely comes from our next finding: what happens when you ask the new Bing to retrieve information from a specific URL?
To conduct our test, we asked the new Bing to summarize the content of a page located on the DataDome website. We asked it to ensure it was using the latest version to try to force it to make a request to our site.
Even though we asked Bing GPT to retrieve the latest version of the URL, we don’t see any requests made to the URL, no matter the IP address or the user-agent.
However, in going over the previous 24 hours of our logs, we observed that Bingbot made several requests to this page (among others on our website). This activity appears to be the standard Bingbot scraper analyzing every public page for display on the search engine.
This is strong evidence that the new Bing is probably using the content gathered by Bingbot. However, it is not performing HTTP requests in the moment to gather information about URLs provided in the Chat UI.
In future testing we could go further by delivering a special page only to Bingbot, then see if that content is the one used when asking questions about it in Bing’s Chat UI.
How can I opt out of the new Bing GPT feature?
When it comes to large language models (LLMs), opting out can mean two things:
- You don’t want your data to be used to train the LLM.
- You don’t want Bing GPT users to make queries on your website’s content without them actually visiting your site.
The difference between the two may seem subtle, but it is fundamental to the issue at hand. LLMs are trained on huge volumes of data. If your data are not included in the LLM training dataset, the LLM will answer less accurately for questions whose answer was only located on your website and was difficult to generalize from other sources in the training dataset.
However, there is nothing preventing the LLM interface UI—here, the Bing Chat UI—from dynamically fetching content of URLs/pages in response to user queries, and feeding the content it retrieves to the LLM. Thus, even though the content of your site was not originally used to train the LLM, the LLM may still be able to use it to improve the quality of its inference. We discussed this use-case in the context of ChatGPT plugins.
Opting Out of the New Bing
In September 2023, Bing proposed a mechanism to help webmasters control how their content is used by AI.
Adding the “nocache” tag means that only the URL/Snippet/Title can be included in the chat answer—not the body of the content itself. Note that content with the “nocache” tag may still be used for LLM training purposes.
Content tagged with the “noarchive” tag will not be included in Bing Chat answers. It will also be excluded from Microsoft’s generative AI foundation models training datasets.
The tag can be used as follows: <meta name="robots" content="noarchive, nocache">
.
If the website wants to only specify the tag for Bing, the tag can be: <meta name="robots" content="noarchive"> <meta name="bingbot" content="nocache">
.
Conclusion
While there are arguments in the AI community about possibly adding an ai.txt file similar to robots.txt, this is not a reality yet. Currently, businesses must manage how each different LLM’s web crawlers and scrapers can access and utilize their online content, especially search engine scrapers like Googlebot and Bingbot.
However, keep in mind that not only the FAANG are training LLMs. Other companies that don’t have access to search engine scrapers are also training their LLMs—and there is nothing preventing them from indexing your content website and making it available through an LLM. Neither a robots.txt file nor an ai.txt file prevents companies from scraping your website for data.
The only reliable long-term solution is to implement proper bot detection mechanisms so that crawlers and scrapers can be detected and blocked, even if they don’t respect the robots.txt file and try to remain undetected by forging their user-agent or using clean residential IP addresses. Advanced solutions like DataDome’s bot and online fraud protection leverage AI and machine learning (ML) to detect and stop unfamiliar bots from the very first request, giving you peace of mind, and keeping your content from being used in LLM datasets without your permission.
*** This is a Security Bloggers Network syndicated blog from DataDome Blog – DataDome authored by Antoine Vastel, PhD, Head of Research. Read the original post at: https://datadome.co/threats/about-new-bing-gpt-integration/