In this post, we will discuss robots.txt and it's usage.
What is robots.txt?
It is part of Robots Exclusion Protocol(REP) and an optional file located at the root directory of your website to tell webmasters, search engines, and other robots about the website's crawling preferences and instructions. In general, most popular and good webmasters and search engines follow the website preferences, but this might not always be the case with malware.
We can also specify the master sitemap file location in robots.txt, which is used by the search engine robots to crawl the web pages listed in it.
What are the advantages of robots.txt?
- It instructs search engine and other robots to crawl the web pages to follow REP.
- It allows/disallows the robots to crawl the site pages for indexing.
- It can avoid crawling and indexing of duplicate pages on your website like the print version of the same page.
- Having proper robots.txt file will have a considerable impact on the site SEO.
What tags are allowed within the robots.txt file?
- User-agent: It accepts either * to mark the rules for all the robots or the robot name for a specific robot.
- Disallow: It consists of a resource(page, directory or file) location to block crawling. If left blank, it allows crawling the whole site.
- Allow: It's specific to Googlebot and overrides the Disallow: instructions to tell robots to crawl specific parts though crawling is disabled.
- Sitemap: It consists of the location of sitemap having site links and instructions.
How to use robots.txt?
In this section, we will discuss most of the common patterns to use robots.txt.
Disallow All
User-agent: * Disallow: /
It's useful in the cases where webmasters do not want indexing of their sites. The possible scenarios include new sites, private sites, sites under construction.
Allow All
User-agent: * Disallow:
It's the simplest form to allow robots to crawl any data without any restrictions.
Sitemap
Sitemap:/sitemap.xml
Though we can place the sitemap.xml file at any publicly accessible location, it's good to place it at web root of the website. We can even use a different name. We can have links to other sitemaps within the main sitemap.
Allow Single Directory
User-agent: * Disallow: / Allow: /blog/
We can instruct the robots to crawl pages available in a single directory. We can specify only one directory in a line, though we can allow the robots to crawl multiple directories by having one permission on a line. Note that the Allow works only with Googlebot and does not have an impact on any other robots.
Allow Single Page
User-agent: * Disallow: / Allow: /index.html Allow: /about-us.html
Similar to the single directory, we can allow the crawlers to crawl only one page. The instructions can have one page in a line, though multiple entries are possible as shown above.
We can also mix directory and pages as mentioned below keeping one rule in a line.
User-agent: * Disallow: / Allow: /index.html Allow: /blog/
Selective User Agents
User-agent: * Disallow: / Allow: /index.html User-agent: Googlebot Allow: / User-agent: discobot Disallow: /
The above example disallows all robots and allows Googlebot to crawl only the website's landing page. At the same time, it allows Googlebot to crawl any part of the website. It also shows how to completely disallow Discobot to crawl the entire website.