We use Google daily to find information across billions of pages in under a second.
Do we ever stop to think how magical this is? How does a search index work?
Today we are going to dive into search and what is a search index.
In this article we will cover:
Let's get started.
What is a search index?
The job of a search index is to make it fast to find web pages.
Google builds the world's most used search index which you use every day.
The search index is “the magic” and the reason the queries return results so quickly. When you consider that Google returns 5 billion results in less than a second, it is impressive!
Have you ever wondered how they do it?
The search index job is to store and retrieve documents as fast as possible. These documents are usually web pages but they can be video, audio, PDFs, etc. We can add any digital file to an index.
Next, let's look at how are indexes made.
How are search indexes made?
Let's pretend that we are going to build a search engine.
At first, this can seem daunting! We don't have the power of Google.
Yet, the concepts of a search engine are easy to understand.
We start with a set of documents. These are web pages on a site but can be anything such as word documents or PDFs.
We then take each document and add it to a corpus.
This is a strange word when you first hear it but a corpus is a collection of text / documents.
At this point, we have a collection of documents. Can you imagine if you needed to find a phrase inside this corpus! You would have to loop through every document in the corpus until you found the phrase.
This is possible but as I'm sure you can imagine this is going to be slow.
The larger the corpus the longer this will take to complete.
We need our search engine to be faster!
An index is a solution to this problem. Instead of looping through all the files, we can build an index to make finding words faster.
But how does an index help?
An index is a lookup table with a list of words inside each document and a link to the document. As a simplified example we could have a lookup table like this: