What is robots.txt file and how to set it up


Have you ever wondered how websites control which areas are off-limits to search engines? The answer lies in a tiny yet influential file called ‘robots.txt’. You can use it to communicate with the search bots that crawl your website, but you must have a deeper understanding of their language to be able to use it properly.

In this article, we will delve into the details of what a robots.txt file is, how to configure it, and how to check if the file is working properly. What’s more, we will provide general guidelines for the contents of a robots.txt file.

Let’s jump in!

What is a robots.txt file?

A robots.txt file is a text document located in the root directory of a website, containing information specifically intended for search engine crawlers. It instructs them on which URLs, including pages, files, folders, etc., should be crawled and which ones should not. While the presence of this file is not mandatory for a website’s operation of the website, its correct setup is crucial for effective SEO.

The decision to use robots.txt was made back in 1994 as part of the Robot Exclusion Standard. According to Google Search Central, the primary purpose of this file is not to hide web pages from search results, but instead to limit the number of requests made by robots to sites and to reduce server load.

Generally speaking, the content of the robots.txt file should be viewed as a recommendation for search crawlers that defines the rules for website crawling. To access the content of a site’s robots.txt file, simply type “/robots.txt” after the domain name in the browser.

How does robots.txt work?

First of all, it’s important to note that search engines need to crawl and index specific search results displayed on SERPs. To accomplish this task, web crawlers systematically browse the web, collecting data from each webpage they encounter. The term “spidering” is occasionally used to describe this crawling activity.

When crawlers reach a website, they check the robots.txt file, which contains instructions on how to crawl and index pages on the website. If there is no robots.txt file, or it does not include any directives that forbid user-agent activity, search bots will proceed to crawl other information on the site.

Why do you need robots.txt?

The primary function of the robots.txt file is to prevent the scanning of pages and resource files, which allows the crawl budget to be allocated more efficiently. In the vast majority of cases, the robots.txt file hides information that provides no value to website visitors and search bots. What’s more, the robots.txt file is often used to improve how efficiently web crawling resources are utilized.

Note: Using the “robots.txt disallow” directive does not guarantee that a particular webpage will be excluded from SERPs. Google reserves the right to consider various external factors, such as incoming links, when determining the relevance of a webpage and its inclusion in search results. To explicitly prevent a page from being indexed, it is recommended to use the “noindex” robots meta tag or the X-Robots-Tag HTTP header. Password protection can also be used to prevent indexing.

Optimize Crawl Budget

Crawl budget refers to the number of web pages that a search robot devotes to crawling a specific website. To use the crawl budget more efficiently, search robots should be directed only to the most important content on websites and blocked from accessing unhelpful information.

Optimizing the crawl budget helps search engines allocate their limited resources efficiently, resulting in faster indexing of new content and improved visibility in search results. It’s important to keep in mind, however, that surpassing your site’s allocated crawl capacity can result in unindexed pages on your website, and unindexed pages can’t appear anywhere on the SERP. So consider your crawl budget if you have a large website or a significant percentage of unindexed pages.

To monitor and analyze the rankings of webpages indexed by Google, you can use the Google Rank Tracking tool. This tool provides 100% accurate keyword rankings in Google and valuable insights on search volume, SERP snippets, traffic forecast, visibility, and way more.

Let’s consider a scenario where your website has a significant amount of content such as PDFs, videos, and images that hold less significance compared to the website’s primary content. In such cases, you can tailor your approach to exclude these resources from search engine indexing, thereby optimizing the crawl budget.

For instance, you can use the “Disallow” directive followed by a specific file extension, such as “Disallow:/*.pdf,” to prevent search engines from crawling and indexing any PDF resources on your site. This provides an effective way to hide such resources and ensure they are not included in search engine results.

Another common benefit of using robots.txt is its ability to address content-crawling issues on your server, if any. For instance, if you have infinite calendar scripts that may cause problems when frequently accessed by robots, you can disallow the crawling of that script through the robots.txt file.

You may also wonder whether it is better to use robots.txt to block affiliate links in order to manage your website’s crawl budget or to utilize the noindex tag to prevent search engines from indexing those links. The answer is simple: Google is pretty good at identifying and disregarding affiliate links on its own. But by using robots.txt to disallow them, you retain control and potentially conserve the crawl budget more effectively.

Example of robots.txt content

Having a template with up-to-**** directives can help you in creating a properly formatted robots.txt file accurately, specifying the required robots and restricting access to relevant files.

User-agent: [bot name]

Disallow: /[path to file or folder]/

Disallow: /[path to file or folder]/

Disallow: /[path to file or folder]/

Sitemap: [Sitemap URL]

Now, let’s explore a few examples of what a robots.txt file might look like.

1. Allowing all web crawlers access to all content.

Here’s a basic example of a robotx.txt file that grants access to all websites to all web crawlers:

WizzAir robots.txt

In this example, the “User-agent” directive uses an asterisk (*) to apply the instructions to all web crawlers. The “Disallow” directive is left empty, indicating that no content is blocked. This allows all web crawlers unrestricted access to all parts of the website.

2. Blocking a specific web crawler from a specific web page.

The following example specifies the access permissions for the “Bingbot” user-agent, which is the web crawler used by Microsoft’s search engine, Bing. It includes a list of website directories that are closed for scanning, as well as a few directories and pages that are allowed to be accessed on the website.

Airbnb robots.txt

3. Blocking all web crawlers from all content.

User-agent: *
Disallow: /

In this example, the “User-agent” directive still applies to all web crawlers. However, the “Disallow” directive uses a forward slash (/) as its value, indicating that all content on the website should be blocked from access by any web crawler. This effectively tells all robots not to crawl any pages on the site.

Please note that blocking all web crawlers from accessing a website’s content using the robots.txt file is an extreme measure and is not recommended in most cases. Websites typically use the robots.txt file to control access to specific parts of their site, such as blocking certain directories or files, rather than blocking all content.

How to find robots.txt

When it comes to locating the robots.txt file on a website, there are a couple of methods you can use:

  1. Check the domain + “/robots.txt”. 

The most common way to find the robots.txt file is by appending “/robots.txt” to the domain name of the website you want to examine. For example, if the website’s domain is “example.com,” you would enter “example.com/robots.txt” into your web browser’s address bar. This will take you directly to the robots.txt file if it exists on the website.

  1. Analyze your website using automated tools like SE Ranking’s Site Audit.

Another way to identify the presence of a robots.txt file is by utilizing a website audit tool. This tool, for example, checks your site and provides information on whether you have a robots.txt file and which pages it blocks. Review the blocked pages to determine if they should be blocked or if access was accidentally restricted.

To start the audit, simply initiate the process and wait for it to complete (you’ll receive a notification in your inbox). Then, go to the Issue Report, select the Crawling block, and check for the Robots.txt file not found problem.

Robots.txt file not found problem

RUN A WEBSITE AUDIT

Score your website in 2 minutes.

Enter any website URL to get a detailed report on tech issues and suggested solutions.

How search engines find your robots.txt file

Search engines have specific mechanisms to discover and access the robots.txt file on your website. Here’s how they typically find it:

1. Crawling a website: Search engine crawlers continuously traverse the web, visiting websites and following links to discover web pages.

2. Requesting robots.txt: When a search engine crawler accesses a website, it looks for the presence of a robots.txt file by adding “/robots.txt” to the website’s domain. 

Note: After successfully uploading and testing your robots.txt file, Google’s crawlers will automatically detect it and begin using its instructions. There is no need for you to take any further action. However, if you have made modifications to your robots.txt file and want to promptly update Google’s cached version, you’ll need to learn how to submit an updated robots.txt file

3. Retrieving robots.txt: If a robots.txt file exists at the requested location, the crawler will download and parse the file to determine the crawling directives. 

4. Following instructions: After obtaining the robots.txt file, the search engine crawler follows the instructions outlined within it. 

Robots.txt vs meta robots vs x-robots

While the robots.txt file, robots meta tag, and X-Robots-Tag serve similar purposes in terms of instructing search engine bots, they differ in their application and effectiveness.

When it comes to hiding the site content from search results, relying solely on the robots.txt file may not be enough. As mentioned above, the robots.txt file is primarily used to communicate with web crawlers and inform them about which areas of a website they are allowed to access. However, it does not guarantee that the content will not be indexed by search engines. To prevent indexing, webmasters should employ additional methods.

One effective technique is using the robots meta tag, which is placed within the <head> section of a page’s HTML code. By including a meta tag with the “noindex” directive, webmasters explicitly signal search engine bots that the page’s content should not be indexed. This method provides more precise control over individual pages and their indexing status compared to the broad directives of the robots.txt file.

Here’s an example code snippet for preventing search engine indexing at the page level:

<meta name=“robots” content=“noindex”>

By including this meta tag within the <head> section, website owners can effectively communicate to search engine bots that the content of this particular page should not be indexed.

You can also utilize the X-Robots-Tag in the site’s configuration file to further limit page indexing. By specifying the URL of a page in the X-Robots-Tag, webmasters can directly communicate to search engine bots that the page’s content should not be indexed. This method offers an additional layer of control and flexibility in managing indexing at a granular level.

To learn more about this topic, make sure to read our complete guide on the robots meta tag and X-Robots-Tag.

Pages and files that are usually closed off via robots.txt

1. Admin dashboard and system files.

Internal and service files that website administrators or webmaster interact with.

2. Auxiliary pages that only appear after specific user actions.

These can include messages that clients receive after successfully completing an order, client forms, authorization or password recovery pages.

3. Search pages.

Pages displayed after a website visitor enters a query into the site’s search box are usually closed off from search engine crawlers.  

4. Filter pages.

Results that are displayed with an applied filter (size, color, manufacturer, etc.) are separate pages and can be looked at as duplicate content. SEO experts typically prevent them from being crawled unless they drive traffic for brand keywords or other target queries. Aggregator sites may be an exception.

5. Files of a certain format.

Files like photos, videos, .PDF documents, JS files. With the help of robots.txt, you can restrict the scanning of individual or extension-specific files.

Robots.txt syntax

Understanding the syntax and structure of the robots.txt file is essential for webmasters to control the visibility of their web pages on search engines. Usually, the robots.txt file contains a set of rules that determine which files on a domain or subdomain can be accessed by crawlers. These rules can either block or allow access to specific file paths. By default, if not explicitly stated in the robots.txt file, all files are assumed to be allowed for crawling.

The robots.txt file consists of groups, each containing multiple rules or directives. These rules are listed one per line. Each group begins with a User-agent line that specifies the target audience for the rules.

A group provides the following information:

  • The user agent to which the rules apply.
  • The directories or files that the user agent is allowed to access.
  • The directories or files that the user agent is not allowed to access.

When processing the robots.txt file, crawlers follow a top-to-bottom approach. A user agent can only match one rule set. If there are multiple groups targeting the same user agent, these groups are merged into a single group before being processed.

Here’s an example of a basic robots.txt file with two rules:

User-agent: Googlebot
Disallow: /nogooglebot/

User-agent: *
Allow: /

Sitemap: https://www.example.com/sitemap.xml

If you want more precise control over web crawler behavior, you can use regular expressions alongside flexible instructions

A commonly used symbol in regular expressions for robots.txt files is the asterisk (*), which acts as a wildcard, representing any variation in value. For example, to allow access to all URLs under a specific directory, you can use the pattern “/example/” in your robots.txt file. This would match URLs like “/example/page1.html”, “/example/subdirectory/page2.html”, and so on, allowing the web robots to crawl those URLs.

Another symbol that can be used in regular expressions for robots.txt files is the dollar sign ($), which signifies the end of the URL path.. For instance, if you have a URL pattern “/blog/$”, it would only match URLs that end with “/blog/”, such as “/blog/” or “/category/blog/”. It would not match URLs like “/blog/article” or “/blog/page/2”.

Now, let’s look at different elements of robots.txt syntax in more detail.

The User-Agent Directive

The user-agent directive is mandatory and defines the search robot to which the rules apply. Each rule group starts with this directive if there are several bots.

Google has several bots responsible for different types of content.

  • Googlebot: crawls websites for desktop and mobile devices
  • Googlebot Image: displays site images in the “Images” section
  • Googlebot Video: scans and displays videos
  • Googlebot News: selects useful and high-quality articles for the “News” section
  • Google-InspectionTool: a URL testing tool that mimics Googlebot by crawling every page it’s allowed access to
  • Google StoreBot: scans various web page types, such as product details, cart, and checkout pages
  • Adsense: ranks a site as an ad platform in terms of ad relevance

The complete list of Google robots (user agents) is available in the official Help documentation.

Other search engines also have their relevant robots, such as Bingbot for Bing, Slurp for Yahoo!, Baiduspider for Baidu, and many more. There are over 500 various search engine bots.

Example

  • User-agent: * applies to all existing robots.
  • User-agent: Googlebot applies to Google’s robot.
  • User-agent: Bingbot applies to Bing’s robot.
  • User-agent: Slurp applies to Yahoo!’s Robot.

The Disallow Directive

Disallow is a key command that instructs search engine bots not to scan a page, file or folder. The names of the files and folders that you want to restrict access to are indicated after the “/” symbol.

Example 1. Specifying different parameters after Disallow.

Disallow: /link to page disallows access to a specific URL.

Disallow: /folder name/ closes access to the folder.

Disallow: /image/ closes access to the image.

Disallow: /. The absence of any instructions after the “/” symbol indicates that the site is completely closed off from scanning, which can be useful during website development.

Example 2. Disabling the scanning of all .PDF files on the site.

User-agent: Googlebot

Disallow: /*.pdf

The Allow Directive

In the robots.txt file, the Allow directive functions opposite to Disallow by granting access to website content. These commands are often used together, especially when you need to open access to specific information like a photo in a hidden media file directory.

Example. Using Allow to scan one image in a closed album.

Specify the Allow directive with the image URL and, in another line, the Disallow directive along with the folder name where the file is located. The order of lines is important, as crawlers process groups from top to bottom.

Disallow: /album/

Allow: /album/picture1.jpg

The “robots.txt Allow All” directive is typically used when there are no specific restrictions or disallowances for search engines. However, it’s important to note that the “Allow: /” directive is not a necessary component of the robots.txt file. In fact, some webmasters choose not to include it at all, relying solely on the default behavior of search engine crawlers.

The Sitemap Directive

The sitemap directive in robots.txt indicates the path to the sitemap. This directive can be omitted if the sitemap has a standard name, is located in the root directory, and is accessible through the link “site name”/sitemap.xml, similar to the robots.txt file.

Example

Sitemap: https://website.com/sitemap.xml

While the robots.txt file is primarily used to control the scanning of your website, the sitemap helps search engines understand the organization and hierarchy of your content. By including a link to your sitemap in the robots.txt file, you provide search engine crawlers with an easy way to locate and analyze the sitemap, leading to more efficient crawling and indexing of your website. So including a reference to your sitemap in the robots.txt file is not mandatory, but highly recommended. 

How to сreate a robots.txt file

A well-crafted robots.txt file serves as the foundation of technical SEO. 

Since the file has a .txt extension, any text editor that supports UTF-8 encoding will suffice. The simplest options are Notepad (Windows) or TextEdit (Mac).

Most CMS platforms also provide solutions for creating a robots.txt file. For instance, WordPress creates a virtual robots.txt file by default, which can be viewed online by appending “/robots.txt” to the website’s domain name. However, to modify this file, you need to create your own version. This can be done either through a plugin (e.g., Yoast or All in One SEO Pack) or manually.

Magento and Wix, as CMS platforms, also automatically generate the robots.txt file, but it contains only basic instructions for web crawlers. This is why it’s recommended to make custom robots.txt instructions within these systems to accurately optimize the crawling budget. 

You can also use tools like SE Ranking’s Robots.txt Generator to generate a custom robots.txt file based on the specified information. You have the option to create a robots.txt file from scratch or to choose one of the suggested options. 

If you create a robots.txt file from scratch, you can personalize the file in the following ways:

  • By configuring directives for crawling permissions. 
  • By specifying specific pages and files through the path parameter. 
  • By determining which bots should adhere to these directives. 

Alternatively, pre-existing robots.txt templates, including widely used general and CMS directives, can be selected. It is also possible to include a sitemap within the file. This tool saves time by providing a ready-made robots.txt file for download.

Document title and size

The robots.txt file should be named exactly as mentioned, without the use of capital letters. According to Google guidelines, the file size should not exceed 500 KiB. Exceeding this limit may result in partial processing, no crawling of the website at all, or, conversely, complete scanning of the website’s content.

Where to place the file

The robots.txt file must be located at the root directory of the website host and can be accessed via FTP. Before making any changes, it is recommended to download the original robots.txt file in its original form.

How to check your robots.txt file

Errors in the robots.txt file can lead to the exclusion of important pages from the search index or even render the entire site practically invisible to search engines.

You can easily check your Robots.txt file with SE Ranking’s free Robots.txt Tester. Simply enter up to 100 URLs to test and verify if they are allowed for scanning. 

Alternatively, you can use the testing tool in Google Search Console. Note that the robots.txt file check option is missing in the new Google Search Console interface and needs to be accessed directly.

robots.txt Tester

Common robots.txt issues

When managing your website’s robots.txt file, several issues can impact how search engine crawlers interact with your site. Some most common ones include:

  • Format mismatch: If the file is not created in the .txt format, web crawlers will not be able to detect and analyze it.
  • Wrong placement: Your robots.txt file should be located in the root directory. If it is located, for instance, in a subfolder, search bots may fail to find and access it.
  • Disallow without value: A Disallow directive without any content implies that bots have permission to visit any pages on your website.
  • Blank lines in the robots.txt file: Ensure there are no blank lines between directives. Otherwise, web crawlers might have difficulty parsing the file. The only case where are blank link is allowed is before indicating a new User-agent.
  • Blocking a page in robots.txt and adding a “noindex” directive: This creates conflicting signals. Search engines may not understand the intent or ignore the “noindex” instruction altogether. It’s best to use either robots.txt to block crawling or “noindex” to prevent indexing, but not both simultaneously.

Additional tools/reports to check for issues

There are many ways to check your website for possiblerobots.txt file-related issues. Let’s review the most widely used ones.

1. Google Search Console.

Within the Pages section of GSC, you can find valuable information about your robots.txt file.

To check if your website’s robots.txt file is blocking Googlebot from crawling a page, follow these steps:

  • Access the Pages section and navigate to the Not Indexed category.
GSC Pages report
  • Look for the error labeled Blocked by robots.txt and select it.
blocked by robots.txt
  • Clicking on this section will show you a list of pages currently blocked by your website’s robots.txt file. Make sure these are the intended blocked pages.
URL examples blocked by robots.txt

You should also check if you have the following issue in this section: Indexed, though blocked by robots.txt.

Indexed though blocked by robots.txt

You can also check if individual URLs are indexed by pasting them into the search box in Google Search Console’s URL Inspection tool. They can help identify pages that appear in SERPs despite being blocked with a Disallow directive in your robots.txt file. It can also help you detect potential indexing issues caused by conflicting directives or misconfigured robots.txt rules.

Here’s a complete Google Search Console guide on detecting and addressing indexing-related problems.

2. SE Ranking’s Website Audit

SE Ranking’s Website Audit tool (and others like it) provides a comprehensive overview of your robots.txt file, including information about pages that are blocked by the file. It can also help you check indexing and XML sitemap-related issues.

To gain valuable insights about your robots.txt file, start by exploring the Issue Report generated by the tool. Among over 120 metrics analyzed, you’ll find the Blocked by robots.txt parameter under the Crawling section. Clicking on it will display a list of webpages blocked from crawling, along with issue descriptions and quick fix tips.

This tool also makes it easy to identify whether you have added a link to the sitemap file in the robots.txt file. Simply check the XML sitemap not found in robots.txt file status under the same section.

XML sitemap not found in robots.txt file status

When navigating to the Crawled Pages tab on the left-hand menu, you can analyze the tech parameters of each page individually. By applying filters, you can ​​focus on solving critical issues on the most important pages. For example, applying the filter Blocked by robots.txt > Yes will show all pages blocked by the file.

Crawled Pages report

SEO best practices

To ensure optimal performance and accurate indexing of your website’s content by web crawlers, it is important to follow SEO best practices, including:

  1. Ensure correct case usage in robots.txt: Web crawlers interpret folder and section names with case sensitivity, so using appropriate case usage is crucial to avoid confusion and ensure accurate crawling and indexing.
  2. Begin each directive on a new line, with only one parameter per line.
  3. Avoid using spaces, quotation marks, or semicolons when writing directives.
  4. Use the Disallow directive to block all files within a specific folder or directory from crawling. This technique is more efficient than listing each file separately.
  5. Employ regular expressions for more flexible instructions when creating the robots.txt file. The asterisk (*) signifies any variation in value, while the dollar sign ($) acts as a restriction and signifies the end of the URL path.
  6. Create a separate robots.txt file for each domain. This establishes crawl guidelines for different sites individually.
  7. Always test a robots.txt file to make sure that important URLs are not blocked by it.

Conclusion

To recap, here are some important takeaways regarding robots.txt files:

  • The robots.txt file serves as a guideline for robots, informing them which pages should and shouldn’t be crawled.
  • The robots.txt file cannot prevent indexing directly, but it can influence a robot’s decision to crawl or ignore certain documents or files.
  • Hiding unhelpful website content with the disallow directive saves the crawl budget. This is true for both multi-page and small websites.
  • It’s necessary to follow syntax rules in order for search bots to read your robots.txt file.



Source link

Social media & sharing icons powered by UltimatelySocial
error

Enjoy Our Website? Please share :) Thank you!