The first template will stop all bots from crawling your site. This is useful for many reasons. For example:
The site is not ready yet
You do not want the site to appear in Google Search results
It is a staging website used to test changes before adding to production
Whatever the reason this is how you would stop all web crawlers from reading the pages:
Here we have introduced two “rules” they are:
User-agent - Target a specific bot using this rule or use the * as a wildcard which means all bots
Disallow - Used to tell a bot that it cannot go to this area of the site. By setting this to a / the bot will not crawl any of your pages
What if we want the bot to crawl the whole site?
2) Allow All
If you do not have a robots.txt file on your site then by default a bot will crawl the entire website. One option then is to not create or remove the robots.txt file.
Yet, sometimes this is not possible and you have to add something. In this case, we would add the following:
At first, this seems strange as we still have the Disallow rule in place. Yet, it is different as it does not contain the /. When a bot reads this rule it will see that no URLs have the Disallow rule.
In other words, the whole site is open.
3) Block a Folder
Sometimes there are times when you need to block an area of a site but allow access to the rest. A good example of this is an admin area of a page.
The admin area may allow admins to login and change the content of the pages. We don't want bots looking in this folder so we can disallow it like this:
Now the bot will ignore this area of the site.
4) Block a file
The same is true for files. There may be a specific file that you don't want ending up in Google Search. Again this could be an admin area or similar.
To block the bots from this you would use this robots.txt.
This will allow the bot to crawl all the website except the /admin.html file.
5) Disallow a File Extension
What if you want to block all files with a specific file extension. For example, you may want to block the PDF files on your site from ending up in Google Search. Or you have spreadsheets that you don't want Googlebot to waste time reading.
In this case, you can use two special characters to block these files:
* - This is a wildcard and will match all the text
$ - The dollar sign will stop the URL matching and represents the end of the URL
When used together you can block PDF files like this:
Or .xls files like this:
Notice how the disallow rule has /*.xls$. This means that it will match all these URLs:
Yet, it would not match this URL:
Because the URL does not end with .xls.
6) Allow Only Googlebot
You can also add rules that apply to a specific bot. You can do this using the User-agent rule, so far we have used a wildcard that matches all bots.
If we wanted to allow only Googlebot to view the pages on the site we could add this robots.txt:
DuckDuckBot - Used by the DuckDuckGo search engine
Baiduspider - This is a Chinese search engine
YandexBot - This is a Russian search engine
facebot - Used by Facebook
Pinterestbot - Used by Pinterest
TwitterBot - Used by Twitter
8) Link to your Sitemap
When a bot visits your site they need to find all the links on the page. A sitemap lists all the URLs on your site. By adding your sitemap to your robots.txt you are making it easier for a bot to find all the links on your site.