SBN

How to Stop Fake Googlebots From Stealing Your Content

“Content is king.” – Bill Gates

If your business operates in media, retail, and/or classified ads, you know that content really is king. Content is often what first draws visitors and readers to your website, and it’s why they keep coming back. In order to succeed, your content must be valuable, visible, and accessible to as many humans as possible. But the duplication of content reduces its value.

Unfortunately, some bots, particularly scrapers, are bent on stealing your content in order to resell or republish it quickly and easily—without your permission. One  particularly dangerous type of scraper bot is a fake Googlebot, which disguises itself as an SEO-friendly crawler to remain unblocked on your website, mobile app, and/or API.

Googlebot: The Ultimate Good Bot

No media, retail, or classifieds website can thrive without a good ranking on Google. The search engine’s ability to drive a huge amount of traffic on websites has helped it secure a favored position in terms of access to online content. Publishers usually make sure that Google gets the VIP treatment when it comes to index information on their pages.

When you set up a bot protection solution, it is critical that it gives you the capability to distinguish good and commercial bots from bad bots. For example, the DataDome dashboard makes it easy to filter the non-human traffic on your website between good and commercial bots (“verified bots”) and bad bots.

Googlebot is the king of good bots—and in the majority of cases, the real Googlebot should not be blocked by your bot protection.

Beware of Fake Googlebots

The special treatment dedicated to Googlebot presents an enticing opportunity for scrapers and fraudsters who want to take advantage of Google’s easy access to your website. To gain  VIP access to your website or app, malicious bot developers go out of their way to make their bots look like Googlebots.

How do you spot fake Googlebots?

DataDome spots more than one million hits per day coming from fake Googlebots on our customers’ websites. Three layers of detection, each increasing in complexity, are executed in real time, thanks to the power of our machine learning (ML) algorithms and the efficiency of our infrastructure:

1. User Agent:

One way scrapers can easily pass for Googlebots is by identifying themselves with the same user agent. Fortunately, many forged user agents include typos, errors, and other distinctive features that make it possible to filter them out.

2. IP Origin:

Googlebot relies on IP addresses managed by Google’s servers. Every hit from a “Googlebot” coming from non-Google servers can therefore be flagged as fraudulent traffic.

However, IP origin alone is not enough to filter out all fake Googlebots. Google offers IP hosting that makes it easy for a bot to use an IP address similar to that of a Googlebot. To ensure full protection, each request requires further analysis—which is where our third layer comes in…

3. IP Owner:

To reach full protection, DataDome uses the reverse DNS method to search for the owner of an IP address, regardless of whose server it is hosted on. The challenge is to do it quickly enough to support a user experience (UX) that is frictionless for end-users. 

For each hit, DataDome cross-checks the IP owner in a database with more than 4 billion entries without adding any latency or slowing down the connection to avoid degrading the UX and/or the SEO ranking of the website.

Conclusion

At the end of the day, bots are shortcuts for speeding and automating tasks—so it’s not surprising when bots gravitate to other shortcuts. Malicious bots will always try to impersonate commercial bots, such as Googlebot, that enjoy minimally restricted access to most websites for fraudulent purposes. 

At DataDome, we see countless bad bots pretending to be Googlebots. Our powerful layers of real-time detection combine with machine learning algorithms to weed out fake Googlebots at every turn, keeping them from reaching your websites, mobile apps, and APIs. DataDome’s detection engine makes each decision in less than 2 milliseconds to ensure your customers are in no way impacted by increased load times or server strain from unrestricted bot traffic or slow bot management processes.

See for yourself how many fake Googlebots are targeting your website in the DataDome dashboard with a free 30-day trial.

*** This is a Security Bloggers Network syndicated blog from DataDome authored by DataDome. Read the original post at: https://datadome.co/learning-center/scrapers-bad-bots-steal-content/