SBN

How to use your Robots.txt to (even partially) block Bots from crawling your site

Search engines use automated programs (robots – or bots, for short) to gather information from websites. Known as web crawlers, the data these bots collect on the robots.txt files indicate what pages can be indexed. The index serves as a repository that enables the search engine to determine what pages to display in search results based on relevant keywords or topics. When you carry out a search, the search engine consults its index to find the pages most relevant to your query.

Without clear instructions from the robots.txt file, a web crawler can search and index every page in a website. This can have several negative impacts for the user experience or for the SEO of the website:

  • Low-priority pages, such as login pages or terms and conditions, may achieve higher search engine rankings than high-value content, such as blog pages or your home page.
  • Searching users may not see the most relevant content in their search results. For example, if a site does not block web crawlers from accessing outdated pages, users might see them instead of the most current pages.
  • Duplicate content can appear in searches, for example, a test page in the sitemap that mirrors another page may be indexed.
  • Without “disallow” instruction on the robots.txt file, web crawlers can overload servers by crawling unnecessary pages, causing performance issues for users.

What Is a Robots.txt File?

The robots.txt file is a simple text file located in the root directory of a website domain. Web crawlers use the directives in the robots.txt file to determine what pages to index. The directives in a robots.txt file apply to all pages on a site, including HTML, PDF, or other non-media formats indexed by search engines.

The instructions in the robots.txt file determine how search engine crawlers analyze the pages, structure, and metadata of each page, such as keywords, titles, and descriptions. This information is then stored in a database called an index. Search engines use the index to locate content that relates to user queries. When a user inputs a search engine query, the search engine retrieves results from its indexed database. Algorithms determine the relevance of the results by evaluating factors such as keyword matches, page quality, and user engagement metrics. Indexing ensures that the search engine can deliver accurate and fast results based on its analysis of the crawled web pages.

Directing the search engine bots to relevant pages is a crucial aspect of search engine optimization (SEO). Doing so makes sure that only high-quality, up-to-date pages are indexed and can be ranked in search results. Pages that are not indexed are harder for users to find as search engines won’t link them to user queries via keywords.

For example, stopping a web crawler from indexing a page about an out-of-date offer will lower the rankings of it in search engine results.

The “disallow” directive in the robots.txt file is used to block specific web crawlers from accessing designated pages or sections of a website.

Optimizing robots.txt with the “disallow” directive can also help reduce the load on a website’s server. When web crawlers access a website too frequently, or all at once, they can generate a large number of requests in a short period. This can put significant strain on a server’s capacity.

Crawling resource-intensive pages such as videos, high-resolution images, or pages that update data in real-time also puts an added load onto the server. When crawlers are directed away from resource-intensive page, it preserves the server’s processing capacity allowing for faster site performance. This results in faster page loading, more responsive user interactions, and improved efficiency in managing dynamic elements like databases.

It’s important to note that this measure should only be applied if heavy web crawler traffic is causing slow performance for users. If web crawler traffic isn’t slowing down site performance, it’s not necessary to restrict access to these pages.

Here is an example of a simple robots.txt file using the “disallow” directive:

How to use your Robots.txt to (even partially) block Bots from crawling your site

In this example, the robots.txt file is blocking Googlebot (the user-agent) from accessing URLs that begin with https://example.com/nogooglebot/.

A slightly more complicated robots.txt file might look like this:

How to use your Robots.txt to (even partially) block Bots from crawling your site

In this example, robots.txt is blocking the user-agents Googlebot and Bingbot. It also disallows all web crawlers’ access to any pages with: /private.html or /special-offers.html. The asterisk character * acts as a wildcard in this case.

Good to know: What is a * wildcard?

A wildcard is a character that represents one or more unspecified characters in a search or pattern. In this case, all web crawlers are blocked from crawling pages with: /private.html or /special-offers.html by the use of the asterisk wildcard character. 

In some cases, robots.txt can be configured with crawl-delay. The crawl-delay directive limits how often a bot can visit a site and request pages to index. Crawl-delay stops bots from overwhelming a site if it has limited server resources or a lot of resource-intensive pages. It ensures the server can handle the traffic without slowing down or crashing. To implement crawl-delay, add ‘Crawl-delay: 10’ in the robots.txt file. The number specifies the delay in seconds between bot requests.

How to use your Robots.txt to (even partially) block Bots from crawling your site

How to Format and What to Include in a Robots.txt File

There are two main terms to be aware of when configuring a robots.txt file:

  • User-Agents: User-agents are simply the names that web crawler bots use to describe themselves. To block indexing robots, for example, Googlebot or Bingbot, just put the user-agent name into the user-agent line of your robots.txt, just like in the above example: https://example.com/nogooglebot/.
  • Allow and Disallow: The ‘allow directive’ and the ‘disallow directive’ describe which specific pages in the xml sitemap bots can and cannot crawl. File names and paths specified in the disallow or allow directives are case-sensitive. Adding the forward slash to the allow directive or the disallow directive makes the command applicable to the entire site.

To block a specific URL, use the disallow directive as below:

How to use your Robots.txt to (even partially) block Bots from crawling your site

To block specific files, you must specify the file path. In this case, a PDF file:

How to use your Robots.txt to (even partially) block Bots from crawling your site

To block specific user-agents, make sure to target the bots by user-agent name:

How to use your Robots.txt to (even partially) block Bots from crawling your site

It is also possible to block multiple bots or URLs using the asterisk character:

How to use your Robots.txt to (even partially) block Bots from crawling your site

Blocking robots from accessing web pages doesn’t remove them or stop users from accessing it. Web crawlers can follow external links to index pages even if the content has been blocked from direct crawling on the original website. For example, if an external blog content has linked to an old blog page on your site. The robots.txt is also publicly accessible, so any user can see what content is being restricted – although they cannot change the restrictions. For these reasons, robots.txt is not the best way to hide sensitive information from the public.

It’s advisable to add a noindex meta tag and password protection to any pages you want to keep completely private, for example, admin panels or user account pages.

The noindex tag is placed within the HTML <head> section and explicitly tells search engines not to index the content of that page.

How to use your Robots.txt to (even partially) block Bots from crawling your site

Understanding the Top Four Search Engine Bots

The top four search engine bots are:

  • Googlebot (Google)
  • Bingbot (Bing)
  • Slurp (Yahoo)
  • DuckDuckBot (DuckDuckGo)

Each one of these bots has a different way of reading and respecting the rules outlined in a website’s robots.txt file.

Googlebot, for instance, will follow the rules outlined in a robots.txt file but it is programmed to have a high frequency of crawling activity. Googlebot may ignore directives if there are minor formatting errors or issues in the robots.txt file.

As an example, Disallow: /private/ might not work if the directory is listed as /Private/ in the URL, as the file paths are case-sensitive in some systems. Googlebot is designed to index as much content as it can. This means that even if Googlebot has been restricted in a robots.txt file, it may still find and index pages via external links from other websites or cached versions.

Bingbot also follows the directives in robots.txt but is slightly more lenient when dealing with minor errors. However, if the directives are not properly formatted, Bingbot might still index those pages despite the instructions to block them.

The Yahoo Slurp bot is generally considered less strict in following directives in the robots.txt file.  In most cases, it will avoid pages restricted by the disallow directive.

The DuckDuckBot is usually respectful of directives outlined in the robots.txt file and does not vary its behavior or make exceptions.

How to Create a Robots.txt File

Many content management programs like Wix or WordPress automatically create a robots.txt file. However, the default directives in these will not be customized to every individual page or piece of content. You may still need to manually customize the robots.txt file.

To do so, you can use common text editors like Notepad or TextEdit. It’s important not to use a word processor as this can result in unexpected characters appearing which will compromise the integrity of the code and cause it to malfunction.

Follow the below rules when creating a robots.txt file:

  • The file must be named robots.txt
  • A site can have only one robots.txt file
  • The robots.txt file must be saved with UTF-8 encoding

Where to Upload a Robots.txt File

Upload the robots.txt file to your website’s root directory. The root directory is the top-level folder that contains all other files. If your website is called www.website.com then upload your robots.txt file as www.website.com/robots.txt.

All domain hosting sites have different server architectures and ways of uploading robots.txt files. Your domain hosting provider will be able to provide you with exact instructions.

How to Check and Verify Your Robots.txt File

Open a private browsing window and search for your robots.txt file via your search engine. If it doesn’t appear, you may need to check with your domain provider to see if the file was uploaded correctly.

You can verify the robotos.txt file via the Google Search Console. The Google Robot Testing Tool (known as the robots.txt Tester) will let you test your file to verify that it is blocking the right content from web crawlers. It’s important to note that the Google Robot Testing Tool is a simulation tool only. Any changes made won’t be reflected in your actual robots.txt file.

The Google Robot Testing Tool only tests against Googlebot and other Google-related bots. It’s advisable to use another tool to test other bots like Bingbot. There are numerous robots.txt tester tools available online, such as:

Blocking AI Crawler Bots Using Your Robots.txt File

Artificial intelligence (AI) bots crawl websites for training data. Many people object to their data being used to train large language models (LLMs) for ethical reasons.

Blocking robots from AI companies is much the same as blocking robots from search engines. All you need to know is the name of the user agent.

In this example, the disallow directive is used to block the OpenAI bot from ChatGPT:

How to use your Robots.txt to (even partially) block Bots from crawling your site

You can also use a wildcard to block all bots with AI in the name:

How to use your Robots.txt to (even partially) block Bots from crawling your site

Be aware that blocking an AI bot via robots.txt only stops the bots from indexing content, not from visiting a page or reading pages. If the bots can access content, for example, via external links, they can still analyze it for training purposes to improve algorithms or models. This can occur even if the pages aren’t indexed, and the content was never intended to be visible to the public.

To completely stop AI bots from collecting data you need to use a combination of both robots.txt and the noindex tag.

Common Robots.txt Mistakes

Three of the most common robots.txt mistakes are:

  • Over-Blocking: Blocking too many pages restricts a search engine’s ability to crawl and accurately rank your pages.
  • Syntax Errors: Typos can be costly. Syntax errors will disrupt your code, causing the robots.txt file to malfunction and become ineffective.
  • Blocking Important Resources: Blocking CSS or JavaScript files stops web crawlers from being able to read your site properly. These files display a site’s layout and functionality.  Without access, crawlers can misinterpret the page’s structure which can impact indexing and search rankings.

How Datadome Can Help You Block and Manage Crawlers

Your website’s robots.txt file can also help protect your data by blocking bad bots. DataDome’s research has shown that hackers and fraudsters often use fake Googlebots to steal content. Almost 3 in 4 of these bots were not detected or blocked. In one case, a prominent real estate company was being attacked by scraper bots. In just one month, no fewer than 1.7 million bots, including 14 named attacks, were blocked.

Scalper bots are used to purchase high-demand items like concert tickets or gaming consoles. They harm consumers by creating artificial scarcity and driving up prices. To be able to make multiple purchases without being detected and blocked, scalper bots must often bypass a series of security and access controls, such as inventory limits, CAPTCHAs, and more.

DataDome is a proven bot management solution that can help you effectively manage bot traffic. DataDome can analyze traffic to your website in milliseconds and block bots before they cause damage.


Robots.txt Disallow FAQs

Can I Block AI Bots?

AI bots can be blocked by adding their user-agent name to the disallow directive in the robots.txt file.

What Happens If I Don’t Have a Robots.txt?

Search engine web crawlers will index every page on your site. This can result in irrelevant content being indexed which can negatively impact your page rankings.

What is the Difference Between Robots.txt and Meta Tags?

Robots.txt controls access to your site at a directory level. Meta tags manage crawling and indexing behavior for individual pages.

*** This is a Security Bloggers Network syndicated blog from DataDome authored by DataDome. Read the original post at: https://datadome.co/bot-management-protection/blocking-with-robots-txt/