Let's dive into the world of web crawlers. A web crawler is software built to read the contents of web pages all over the internet.
Sometimes also called bots or spiders. The most popular web crawler is Googlebot.
Googlebot will visit your website and read the content of the page. Once it understands what the page is about, it will add the pages to the search engine index. The index is where Google stores information about your website.
In this article, we will look at web crawlers the good and the bad ones.
We will cover some of the basics on how you can control where these bots can go on your site. And how you can optimize your site for crawling.
Let's get started.
What is a Web Crawler
A web crawler, spider, robot or bot is software that will crawl the web by following links it finds.
Once they discover a link, they visit the page and read the web page contents.
Some bots are good like Googlebot, Bingbot, Facebot, and Twitterbot.
Some bots are bad. For example, there are bots like Sitesucker. This is a Mac application that will download all the contents of a website. Including all the HTML, images, PDFs, etc to someone's hard disk.
Anyone can install and run Sitesucker from anywhere.
Later, we will look at how you can block some of these unwanted guests.
Before we look at that, it is good to understand how web crawlers work.
How does a Web Crawler Work?
Let's pretend that we are creating a web crawler to search the web for us.
We need to start with a list of URLs that we want to target. So let's say we want to crawl Amazon.com. In that case, we start with the URL:
We give this URL to our web crawler and it goes and fetches the webpage.
It then downloads the HTML from Amazon and has a look for all the links. Once we have the HTML we can look for more links and save these to a list to visit later.
The process repeats as we visit the next link.
After some time our web crawler will have visited all the pages on Amazon.com. It will then discover URLs or links to other websites.
Our web crawler could then go to that website and repeat the process.
This is how web crawlers like Googlebot and Bingbot work. They have a list of URLs that they visit. They will:
Visit a page via a link
Download the web page content
Try to understand what the content is about
Add the content to the search engine index
Find any other links on the page
Then visit the next link
This process repeats every day for millions of websites across the internet.
What Types of Bots are there?
There are thousands of bots on the web. Anyone can make one and start crawling.
Many websites have a bot, here are some of the most popular ones:
Googlebot - Used for Google Search
Bingbot - Used for Bing Search
Slurp - Yahoo's web crawler
DuckDuckBot - Used by the DuckDuckGo search engine
Baiduspider - This is a Chinese search engine
YandexBot - This is a Russian search engine
facebot - Used by Facebook
Pinterestbot - Used by Pinterest
Twitterbot - Used by Twitter
These companies use the bots to add your content to their website. For search engines like Google, all their content is from other sites. So they use the bots to get this content.
For social media sites like Facebook and Twitter, these bots enhance links on their site. For example, rather than showing a simple link in a tweet. Twitterbot will visit the page and grab extra information. Now instead it can display a big card like this:
Without the Twitterbot, this would not be possible.
There are less common web crawlers like:
AhrefsBot - An SEO company that tracks web page rankings and more
SemrushBot - Another SEO company with similar products to Ahrefs
Grapeshot - This is an Oracle Data Cloud Crawler a web crawler that downloads data from the internet
Alexabot - This is not Amazon Alexa. This is a bot that tracks the rankings of all the websites on the internet
Uptimebot - Used by the service Uptime.com to track if your website goes down
There are 100s of them. Built for different purposes by companies and individuals.
You may be wondering can they all crawl my site?
Can Crawlers always Crawl my Site?
With so many bots crawling the internet it is a fair question to ask. After all, do you want to give your data to all these bots?
Well, you do have some control. You can use your robots.txt file.
In the above it allows Googlebot and disallows all other bots.
Now nothing is stopping these bots from ignoring the file but, they won't. After all, reputable companies have built these bots. If they built a bot that ignored the robots.txt file there would be an outcry.
There are of course bad bots. Bots that do not play by the rules and want your data.
Although there are some advanced techniques for dealing with these bad bots. They are very difficult to stop.
When you put data on the internet it is very hard to protect it.
How do I Optimize my Website so it is Easy to Crawl?
Once you have controlled which bots can view your site, how do you optimize your website for web crawlers?
This is Technical SEO, which is all about optimizing your website to make it as easy to crawl as possible. Some of these technical tasks include:
Crawl Errors - When bots like Googlebot visit your website it may discover errors. You need to check and fix them. Google shows you these errors in Google Search Console. Bing does the same in the Bing Webmaster Tools
Submit an XML Sitemap - You can submit your sitemap to Google and Bing. The sitemap is a list of all the URLs on your site. Making it easier for web crawlers to find all the pages