What is a search index?
We take search for granted today on the internet.
We use Google daily to find information across billions of pages in under a second.
Do we ever stop to think how magical this is? How does a search index work?
Today we are going to dive into search and what is a search index.
In this article we will cover:
Let's get started.
What is a search index?
The job of a search index is to make it fast to find web pages.
Google builds the world's most used search index which you use every day.
The search index is “the magic” and the reason the queries return results so quickly. When you consider that Google returns 5 billion results in less than a second, it is impressive!
Have you ever wondered how they do it?
The search index job is to store and retrieve documents as fast as possible. These documents are usually web pages but they can be video, audio, PDFs, etc. We can add any digital file to an index.
Next, let's look at how are indexes made.
How are search indexes made?
Let's pretend that we are going to build a search engine.
At first, this can seem daunting! We don't have the power of Google.
Yet, the concepts of a search engine are easy to understand.
We start with a set of documents. These are web pages on a site but can be anything such as word documents or PDFs.
We then take each document and add it to a corpus.
Corpus!?
This is a strange word when you first hear it but a corpus is a collection of text / documents.
At this point, we have a collection of documents. Can you imagine if you needed to find a phrase inside this corpus! You would have to loop through every document in the corpus until you found the phrase.
This is possible but as I'm sure you can imagine this is going to be slow.
The larger the corpus the longer this will take to complete.
We need our search engine to be faster!
An index is a solution to this problem. Instead of looping through all the files, we can build an index to make finding words faster.
But how does an index help?
An index is a lookup table with a list of words inside each document and a link to the document. As a simplified example we could have a lookup table like this:
Word | Documents |
---|---|
the | Document 1, Document 3, Document 4, Document 5, Document 7 |
cow | Document 2, Document 3, Document 4 |
says | Document 5 |
moo | Document 7 |
In the above example, we have created an index that lists the words found in each document. Next to this is the documents that contain the word.
If we are looking for the phrase “the cow” then we can see that the only documents with both words are Document 3 and 4.
This is an index, albeit a straightforward one. The principle is the same, even for the likes of Google's index. Yet, the algorithms used to generate their index are much more complex.
What's the need for indexing by search engines?
As we have shown the purpose of creating the index is to make searching fast. The speed and accuracy at which the index can return relevant web pages are at the heart of the index.
Google takes your web pages and adds them to the index so that they can return the page for searches. They need to tune and improve the index to make it as fast as possible.
When you add your web pages to Google you are including them in the world's largest search index.
How does Google add a page to the index? They use a search engine crawler.
What is search engine crawling?
To recap we need to gather the documents add them to a corpus and then build an index.
That sentence should now make sense!
How do you get the documents in the first place?
This is the job of a crawler. Google has a web crawler called Googlebot which will visit your site and look at the content.
It will then take the content and store it in a cache as a document.
It will do this for all the pages on your site and other sites on the internet. Making sure that your site is discoverable by Googlebot is Technical SEO.
Tuning your Technical SEO will improve how your content appears in Google Search. It deals with making sure that crawlers can find all your content.
Can crawlers find all your important content?
Making sure that Googlebot can visit the pages on your site is a complex subject. Technical SEO covers many aspects of your web pages including:
- Page Speed
- Security
- Robots.txt
- Sitemaps
- WhoIs
- AMP
- Canonical Tags
- Image Optimization
- Broken Links
- Mobile Friendly
- HTML errors/W3C validation
- Site Architecture
- Site Uptime
- Schema.org Usage
- Redirects
The above factors affect whether your page appears and how it appears in the search results.
To discover more on this subject then take a look at this overview of technical SEO.
Can I see how a Googlebot crawler sees my pages?
The easiest way to see if Googlebot can see your pages is to use Google Search Console. This website is free from Google and it gives you feedback on the crawling of your site.
Once you have set up Google Search Console you can log in and see things like:
- When Googlebot visits your site
- Any errors with crawling
- The search performance on Google
- Which keywords and pages are ranking
- AMP pages
- Schema markup
Once you have Google configured you can also do the same and set up Bing.
How are search results returned from an index?
We now have crawled the pages on the site. Once crawled these documents go into the corpus and from this, we can create a search index. But what determines the order of the results?
This is a search ranking.
We start with a query. In the simple example above our search query is “the cow”.
A search index returns the documents that contain our query phrase.
Search engines use a ranking algorithm that orders the documents by relevance. Once we get back some documents from the index it is the ranking algorithm that will set the order.
Google's ranking algorithm is very secret so that people do not try to cheat it.
Yet, some reports suggest that it includes over 200 factors.
To get a better ranking you need to improve your site SEO. For info on improving your ranking here are some SEO experts you should follow:
Wrapping Up, What is a search index?
You have learned exactly what is a search index.
You know that an index helps us quickly find documents that contain a search query.
The biggest search index is Google and it contains billions of web pages, videos, and audio.
Google uses a crawler to fetch the content of your web pages and stores them in its corpus.
It then uses this corpus to build the index.
Finally, it uses a search ranking algorithm to order the results so the best results appear at the top of the page.
Technical SEO is the subject that ensures that your web pages appear in the index.
SEO is the subject that improves your ranking in the results.