BLOG | categories | fundamentals

What is a Web Crawler?

By Steve Founder of PageDart

Let's dive into the world of web crawlers. A web crawler is software built to read the contents of web pages all over the internet.

Sometimes also called bots or spiders. The most popular web crawler is Googlebot.

Googlebot will visit your website and read the content of the page. Once it understands what the page is about, it will add the pages to the search engine index. The index is where Google stores information about your website.

In this article, we will look at web crawlers the good and the bad ones.

We will cover some of the basics on how you can control where these bots can go on your site. And how you can optimize your site for crawling.

Let's get started.

What is a Web Crawler

A web crawler, spider, robot or bot is software that will crawl the web by following links it finds.

Once they discover a link, they visit the page and read the web page contents.

Some bots are good like Googlebot, Bingbot, Facebot, and Twitterbot.

Some bots are bad. For example, there are bots like Sitesucker. This is a Mac application that will download all the contents of a website. Including all the HTML, images, PDFs, etc to someone's hard disk.

Anyone can install and run Sitesucker from anywhere.

Later, we will look at how you can block some of these unwanted guests.

Before we look at that, it is good to understand how web crawlers work.

How does a Web Crawler Work?

Let's pretend that we are creating a web crawler to search the web for us.

We need to start with a list of URLs that we want to target. So let's say we want to crawl Amazon.com. In that case, we start with the URL:

https://amazon.com

We give this URL to our web crawler and it goes and fetches the webpage.

It then downloads the HTML from Amazon and has a look for all the links. Once we have the HTML we can look for more links and save these to a list to visit later.

The process repeats as we visit the next link.

After some time our web crawler will have visited all the pages on Amazon.com. It will then discover URLs or links to other websites.

Our web crawler could then go to that website and repeat the process.

This is how web crawlers like Googlebot and Bingbot work. They have a list of URLs that they visit. They will:

Visit a page via a link
Download the web page content
Try to understand what the content is about
Add the content to the search engine index
Find any other links on the page
Then visit the next link

This process repeats every day for millions of websites across the internet.

What Types of Bots are there?

There are thousands of bots on the web. Anyone can make one and start crawling.

Many websites have a bot, here are some of the most popular ones:

Googlebot - Used for Google Search
Bingbot - Used for Bing Search
Slurp - Yahoo's web crawler
DuckDuckBot - Used by the DuckDuckGo search engine
Baiduspider - This is a Chinese search engine
YandexBot - This is a Russian search engine
facebot - Used by Facebook
Pinterestbot - Used by Pinterest
Twitterbot - Used by Twitter

These companies use the bots to add your content to their website. For search engines like Google, all their content is from other sites. So they use the bots to get this content.

For social media sites like Facebook and Twitter, these bots enhance links on their site. For example, rather than showing a simple link in a tweet. Twitterbot will visit the page and grab extra information. Now instead it can display a big card like this:

Without the Twitterbot, this would not be possible.

There are less common web crawlers like:

AhrefsBot - An SEO company that tracks web page rankings and more
SemrushBot - Another SEO company with similar products to Ahrefs
Grapeshot - This is an Oracle Data Cloud Crawler a web crawler that downloads data from the internet
Alexabot - This is not Amazon Alexa. This is a bot that tracks the rankings of all the websites on the internet
Uptimebot - Used by the service Uptime.com to track if your website goes down

There are 100s of them. Built for different purposes by companies and individuals.

You may be wondering can they all crawl my site?

Can Crawlers always Crawl my Site?

With so many bots crawling the internet it is a fair question to ask. After all, do you want to give your data to all these bots?

Well, you do have some control. You can use your robots.txt file.

What is a robots.txt?

It is a special file that you can add to your website usually found at the root of your website:

https://pagedart.com/robots.txt

This file contains a set of rules that bots will follow when they visit your site.

This is an example of what a robots.txt file looks like:

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow:

In the above it allows Googlebot and disallows all other bots.

Now nothing is stopping these bots from ignoring the file but, they won't. After all, reputable companies have built these bots. If they built a bot that ignored the robots.txt file there would be an outcry.

There are of course bad bots. Bots that do not play by the rules and want your data.

Although there are some advanced techniques for dealing with these bad bots. They are very difficult to stop.

When you put data on the internet it is very hard to protect it.

How do I Optimize my Website so it is Easy to Crawl?

Once you have controlled which bots can view your site, how do you optimize your website for web crawlers?

This is Technical SEO, which is all about optimizing your website to make it as easy to crawl as possible. Some of these technical tasks include:

Crawl Errors - When bots like Googlebot visit your website it may discover errors. You need to check and fix them. Google shows you these errors in Google Search Console. Bing does the same in the Bing Webmaster Tools
Submit an XML Sitemap - You can submit your sitemap to Google and Bing. The sitemap is a list of all the URLs on your site. Making it easier for web crawlers to find all the pages
Secure with an SSL Certificate - Make sure that your site is running over HTTPS not HTTP and is secure
Mobile Friendly - Does your website work on mobile? With no errors?
Robots.txt file - The same file we have already discussed above. The set of rules that bots should follow
Page Speed - This is a large subject but this is the speed that the page loads. There are many improvements you could make. The faster the page loads the quicker the bot can read the contents

We cover many of these areas in more detail in articles on PageDart. By optimizing these technical SEO metrics your site will be easy to crawl.

Wrapping Up, What is a Web Crawler?

We have learned what is a web crawler. You know that it is a special software application that can read and download web pages.

There are many legitimate web crawlers owned by companies such as Google, Bing, Oracle, etc.

The most common web crawler is Googlebot.

You can also improve how easy it is for web crawlers to crawl your website by improving your technical SEO. This includes tasks such as:

Resolving crawl errors
Submitting your sitemap to Google and Bing
Securing your site using HTTPS
Having a mobile friendly website
Using your robots.txt file
Improving your page speed

Improving these technical SEO metrics will increase the crawl rate from bots.