Indexed Though Blocked by Robots txt
Another day wrangling the world of SEO.
You crack open Google Search Console only to find a new warning!
Indexed, though blocked by robots.txt
OK, grab a coffee let's get this sorted.
What does Indexed, though blocked by robots.txt mean?
When Google wants to get an update of your site it sends over a web crawler.
Googlebot is the name of Google’s web crawler.
Googlebot reads your pages and then adds them to the index. Once indexed it appears in the Google Search results.
SEO 101!
This warning in Google Search Console is letting you know that the page was once indexed by Google. Yet, Google can no longer crawl the page.
This is because your websites robots.txt file has a rule that is blocking Googlebot.
Let’s dive into how your robots.txt file can block Googlebot.
What is a robots.txt file
The robots.txt file is a text file found on most websites by added /robots.txt
to the end of the domain name.
Here are some examples you can take a look at:
- https://airbnb.com/robots.txt
- https://amazon.com/robots.txt
- https://philips.com/robots.txt
- https://pinterest.com/robots.txt
Here is a snippet from Airbnb's robots.txt file:
User-agent: Googlebot
Allow: /calendar/ical/
Disallow: /account
...
Disallow: /users/edit
Sitemap: https://www.airbnb.co.uk/sitemap-master-index.xml.gz
Each line in the robots.txt is a rule. A rule tells a web crawler such as Googlebot what it can and cannot do when looking at the site.
So as a SEO you have control over where web crawlers like Googlebot can visit on your site.
In the Airbnb example, they are using the User-agent
rule to target Googlebot.
This means that Googlebot will follow any of the rules underneath the User-agent
rule.
The first rule is:
Allow: /calendar/ical/
Telling Googlebot that it can read pages with this URL.
The next rule is:
Disallow: /account
This is the rule that blocks Googlebot from visiting the page /account
.
Here we are saying “Googlebot nothing is interesting here so there is no point looking at this area of the site.".
Google does not have infinite resources and the job of reading every page on the internet is a tough one. So by using the robots.txt file, you can make Googlebot's time on your site more efficient.
Going back to the Google Search Console warning:
Indexed, though blocked by robots.txt
This warning is letting you know that Googlebot cannot read this URL. Even though this URL was once visited by Googlebot and indexed.
Now a disallow rule in your robots.txt file is blocking Googlebot. Stopping the content from getting updated.
There could be two reasons for this, either:
- You no longer want this page to appear in Google Search results and you want it removed.
- You have blocked this page in your robots.txt file and you still want it to appear in Google Search.
We are going to look at both:
- How to fix the rule in the robots.txt file.
- How you can remove a page from the Google Index if you no longer want it to appear in search.
Remember those choose your own adventure books you read as a kid? Jump to the section to continue your adventure:
Fixing the Rule in the Robots.txt
Let's look at how we can keep the page in Google Search.
For this, we are going to have to fix the robots.txt file and remove the rule that is blocking Googlebot.
To start with open Google Search Console and click on the warnings:
To see a list of pages with this warning click on the “indexed, though blocked by robots.txt” link. This will open up a list of all the pages on your site with this warning.
Clicking on one will open up the tools:
There are two tools you can use to get more information. The first one is to confirm that the problem is still on your live site. Go ahead and click the “INSPECT URL” link.
This will open up a page like the below:
To test that the live site is still not working click on the “TEST LIVE URL” button. The screenshot below shows what you get when the page is still not crawl-able.
If everything is ok and the issue has gone you will see the test results with a tick like this:
If the test passed then you can press the “Request Indexing” link. If the test failed then you need to use the “Test robots.txt blocking” tool from the tools menu:
This will open up a new window
When you click on this link it will show you exactly which rule in the robots.txt file is causing the block. On the example above you can see that we have disallowed the my-great-blog-post
. So to fix this we would need to remove this rule from the robots.txt and retest.
You can edit the robots.txt file on this page and press the submit button. This will give you 3 options:
- Pressing the download button will let you get a copy of the robots.txt file. Upload this to your web server.
- Once the file is uploaded use the “view uploaded version” button to check that it is up-to-date.
- Lastly, press the submit button to give a copy to Google.
Once you have updated the robots.txt file go back to Google Search Console. Check that Google can now read your page using the “URL Inspection” tool.
Removing a page from Google Index
If you no longer want this page to appear in Google Search results then there are two tasks you need to perform:
- Add a noindex meta tag to the page you want to remove
- Remove the URL from Google Search Console
Add a noindex meta tag to the page you want to remove
Google recommends that you use a noindex meta tag to remove sites from Search.
Googlebot reads this on the next visit. If it finds a noindex meta tag then the page will no longer appear in Google search results. Even if other sites link to it.
There are two ways you can add the noindex. One is using a meta tag added to the HTML like this:
<html>
<head>
<meta name="robots" content="noindex">
</head>
<body>
...
</body>
</html>
You can also only prevent Googlebot if you use this:
<html>
<head>
<meta name="googlebot" content="noindex">
</head>
<body>
...
</body>
</html>
The second is using a HTTP header response to the page like this:
HTTP/1.1 200 OK
...
X-Robots-Tag: noindex
...
Which one you choose will depend on exactly how your site is set up. There is no right or wrong way to add noindex so go with the one that is best for you.
Once you have put this in place we can then remove the URL from the Google Index.
Remove the URL from Google Search Console
To force the removal of the page from Google Search you can use the URL Removal tool.
Click on the “Temporarily hide” button and enter the URL you would like to hide.
It is important to note that this will only hide the URL from Google Search for 90 days. You must make sure that Googlebot cannot read the page. If Googlebot reads the page again it will end up back in search.
So make sure that those meta tags are set up correctly.
Wrapping Up, Indexed Though Blocked by Robots txt
We have looked at how you can fix Indexed Though Blocked by Robots txt warnings in Google Search Console.
These pages are showing a warning because they appearing in Google Search. Now, they are being blocked by your robots.txt file.
If you want to keep this page in Google Search then you need to fix the disallow rule. This is a rule in your robots.txt file. Use the Robots Testing Tool to locate the rule that is causing the error.
If you do want to remove this page from Google Search then you need to add a noindex tag. Once you have done this use the Google Search Console URL Removal Tool to remove the page from Search.
Once you have fixed these pages you will be able to remove this warning from your Google Search Console.