What is a robots.txt?
Imagine this, you have put hours into your website writing content.
Only to find out that your site has blocked Google.
None of your pages are in search engine results!
It would be horrible right?!
Well, that is why you need to understand what is a robots txt file. It is very important to your technical SEO and will determine if a web crawler has access to read your site or not.
It is a very powerful file that is often overlooked.
In this article, we will be learning:
- What is a robots.txt file
- Why it is so important
- How you can make sure that your site appears in the search engines index
What is a robots.txt file?
The robots.txt file is for robots (shortened to bots) to read your website content. It is a list of all the rules that a bot must follow when it crawls your site.
Bots such as Googlebot will crawl your site. This means they will read the content of your site and add it to the search engine index.
The robots.txt file can “Disallow” areas of your site and this is why it is important to your technical SEO. It is possible to block bots from reading your entire site!
You can find the robots.txt file at the “root” of your website. This means that you add your robots.txt at the end of the domain name. For example, here is the PageDart robots.txt file:
https://pagedart.com/robots.txt
You don't have to have a robots.txt file. If you don't, all bots can read all your pages. This is not recommended though as it affects your crawl budget.
Let's look at this next.
How to use Robots.txt file?
To get your content into the search engine, Google must read your site's content.
We are approaching 2 billion websites on the internet.
Google has a tough job of indexing all that content and they can't get to every page every day.
So they assign a crawl budget to each site. The crawl budget is how many pages Google will crawl every day on your site.
So when Google does read your website you want to guide them to the important content. You don't want it wasting time on pages that have no real value.
This is where the robots.txt file comes in and is so important to Technical Seo. You can use it to “Disallow” areas of your site that do not need to appear in search.
Let's say that we have an area of the site for admins. This is common on WordPress to have an area of a site with /wp-admin
.
We don't want this to appear in search results or Google to spend time crawling it, using up your valuable crawl budget.
So to deny access to all crawlers you would add a robots.txt file with this content:
User-agent: *
Disallow: /wp-admin/
The “Disallow” rule for /wp-admin/
will block bots from visiting this area of the site.
You can also use your robots.txt to help bots find all the useful URL's on your site using a sitemap.
Robots.txt and Sitemaps
Sitemaps are special files that list all URLs on your site. The sitemap should contain only the pages that you want to include in search engines.
As part of improving the crawl budget, we can make it easier for Google by providing a sitemap. To do this we need to use another robots.txt rule called “Sitemap”.
Here is an example:
User-agent: *
Disallow: /wp-admin/
Sitemap: http://www.example.com/sitemap.xml
When Google's web crawler visits the site it can scan the sitemap and see all the URLs. Anything you can do to help Google be more efficient is going to help your site rank.
You can also include links to more than one sitemap. If your site has many sitemaps include a link to each one in your robots.txt file.
Checking your robots.txt file
Once you have created your robots.txt file you will need to check that it is working. Luckily, there are free tools from Google and Bing that can check to make sure that your site is working.
If you have not done so already make sure to submit your sitemap to both Google and Bing.
To start with let's look at how we can use Bing to check the robots.txt file.
Using Bing to check robots.txt
Bing has a tool that will fetch the site for you as Bingbot. To use it login to the Bing Webmaster Tools.
You can then select your website from the list. At the bottom of the dashboard page, there is a section called “Diagnostics & Tools”. Select the “fetch as Bingbot” option and then add your URL in the box:
Once you hit submit Bingbot will attempt to read your page. If the robots.txt is not blocking the URL then the response will be a 200 OK. This means that Bingbot was able to read your page and was not blocked.
You can test Googlebot using the Google Search Console.
Using Google Search Console to check robots.txt
You can manage how your site appears in Google search engine by using the Google Search Console.
Login to your account and then enter a URL into the “Inspect” box at the top of the page.
Once you enter your URL Googlebot will crawl the page and if all is ok you will see a screen like this:
If the robots.txt file is blocking the page you will see an error.
Let's look at some robots.txt examples.
Robots.txt examples
Let's look at 5 different examples of robots.txt files. This can give you a good understanding of what they do and also you can use them as a template on your site.
Block all
The first example will block all web crawlers from crawling your site.
User-agent: *
Disallow: /
There are two rules here the first one is the User-agent
rule this is set to a *
which is a wildcard. This means that it will match all web crawlers that visit your site.
The second rule is the Disallow: /
rule. This will block all the URLs on your site.
You may be wondering why you would want to stop bots from visiting your site. Well, this can be useful for pre-production or staging websites. Such as a site where you test development first before pushing the changes to production.
Allow all
If your site does not have a robots.txt file then all the URLs on your site are available to web crawlers.
There are times when the software you are using requires you to have a robots.txt file. If this is the case then you can use this file:
User-agent: *
Disallow:
This is a bit confusing as the Disallow rule is still listed. Yet, there is nothing after the Disallow which means there are no URLs allowed.
Block a folder
You may have directories that you want to remove from bots. This could be an admin area, an image section or user profiles.
In this case, you can disallow the folder like this:
User-agent: *
Disallow: /admin/
This example will stop bots from visiting the admin area. The above rule Disallow: /admin/
will stop URLs like this:
You can see that it will block any URL with /admin/
after the domain name.
Block a file
Sometimes you need to block a single file. For example, you may have an admin.php
file that you want to block. To do this you can use a robots.txt file like this:
User-agent: *
Disallow: /admin.php
Disallow a File Extension
This last example blocks a file type. A common reason to have this is that you have PDF files on your site and you don't want these in Google search results.
Here is how you would block bots from reading PDF files on your site:
User-agent: *
Disallow: /*.pdf$
The disallow rule is /*.pdf$
, this means that it will match all these URLs:
https://example.com/files/example.pdf
https://example.com/files/folder2/example.pdf
https://example.com/example.pdf
Yet, it would not match this URL:
https://example.com/example.pdfile
How many bots are there?
All of the examples above use the rule:
User-agent: *
Which means that it will match all the bots that visit your site. But, how many bots are there?
Well, there are quite a few, some popular ones include:
- Googlebot - Used for Google Search
- Bingbot - Used for Bing Search
- Slurp - Yahoo's web crawler
- DuckDuckBot - Used by the DuckDuckGo search engine
- Baiduspider - Baidu is a Chinese search engine
- YandexBot - Yandex is a Russian search engine
- Facebot - Used by Facebook
- Pinterestbot - Used by Pinterest
- TwitterBot - Used by Twitter
You can target specific web crawlers by using the user-agent
rule. This is how you can specify different rules for different bots:
User-agent: Googlebot
Disallow: /
User-agent: DuckDuckBot
Disallow:
The above would block Googlebot and allow DuckDuckBot!
What is a robots.txt? Final thoughts
We have looked at what is a robots.txt file and its importance for your technical SEO.
With this file, you can stop bots from crawling areas of your site. You want bots to only crawl the useful parts of your site.
If you optimize this then Google will make the best use of your crawl budget.
Have a look at the robots.txt file examples if you want to see more advanced uses of robots.txt files.
Using the tools from Bing and Google search console you can make sure that your robots.txt file is doing its job.
When you get advanced you can start to target bots with specific rules by using the user-agent rule.