How does a search engine work? A basic guide for beginners
- August 19, 2019
- No comments
The first search engines on the Internet were relatively simple, yet they made it possible to find the information we needed. Only when Google came into the world in 1998 did we get the most innovative search engine, which remains so even today, even though it has about 30 competitors. How does it work?
The article is part of the study materials in the SEO course from Askpavel Academy
Sometimes there is a tendency to think that when we perform a search on Google (or any other search engine), the engine “runs” on all web pages in real time. In fact, it is a slightly more complicated and lengthy process – but we will simplify and break it down into three easy-to-understand sections: crawling web pages, indexing the pages and finally sorting the pages in the search results.
Step 1 – Scan
At this point the search engine discovers new web pages, or updates made to existing pages. This stage is also called “crawling”, because it is a “spider” bot that actually “crawls” through the pages and links in them. The crawl phase is performed all the time, regardless of when users actually perform a search.
The spiders of the search engine are constantly in the “chase” for web pages – whether it is completely new pages, or whether it is an update of existing pages. All these are crawled by spiders, so that the search engine will have access to the most up-to-date database available for web pages.
What content is crawled in search engines?
At the level of principle, a search engine is supposed to crawl any type of content – but in practice there are small differences between the different types. The most efficiently scanned content is HTML that includes live text. An image is a slightly less understood type of content, so more investment is required to understand it, alongside another identifying factor called Alt Text. Videos are probably beyond the current crawling ability of search engines, so it is always best to transcribe them into live text.
What is the frequency of search engine crawling?
This question depends on a number of factors – what is the scan frequency set by default by the engine creators; What is the crawl budget that Google has allocated for certain types of sites and so on. The combination of all these parameters defines the frequency of crawling – some sites “come to visit” every few seconds (especially large sites, such as news), and some sites are crawled by Google every few weeks or even months.
The webmaster / promoter can influence the crawl frequency due to Google, by optimizing the crawl budget. You can also submit a web page for crawling using the URL checking tool in Google’s search console, or submit a complete sitemap for crawling.
How do I make sure Google crawls my site?
Very simple, type in the search box the phrase: site together with the domain address (without spaces):
All the pages that Google displays in the results in response to this query (in this case 17,400 pages), are the pages that were discovered in the scan and added to the index (we will expand on this in step 2). To get all the pages, including sub-domains – it is recommended to remove www from the query and check only the root domain. If Google does not display any page in response to this query, it means that it did not crawl or add any page to the index, and should find out why (Is the site blocked? Has Google not visited the site for the first time yet?)
To see when was the last time Google crawled the page, look for a new query:
We now get the page as Google sees it, along with the last crawl date (indicated at the top of the page).
Step 2 – Index
After the pages are scanned, the search engine sends them for indexing. You could say that the index is simply the directory in which all the web pages scanned by the search engine are stored.
The pages are indexed according to their status on the last scan date; If you made changes two minutes after Google already crawled the page – you will have to wait for the next scan for the changes to appear in the index (using the cache command mentioned in the previous step). Alternatively, you can “force” Google to re-crawl the search console (see previous section).
Is it possible to determine what will enter and not enter the index?
This can be done using the <meta name = “robots” content = “noindex, nofollow” /> tag
This is a meta tag placed at the top of the page. The tag will inform the search engines not to add the page to their index, using the noindex command (the additional command, nofollow, is only for the links on the page).
The robots command within the tag is intended for all search engines. If we want to block the indexing of specific search engines, we will write their name instead of the phrase robots. This is what it would look like, for example, if we wanted to block the index only on Google:
<meta name = “googlebot” content = “noindex”>
Is it possible to remove pages that have already been indexed?
There are several ways to do this:
- Change the page’s server code to 404 or 410 code or 500 type code – This type of code tells search engines that the page no longer exists or is irrelevant or that the server has crashed. After a while, the page will be removed from the index. Code 404 is created automatically when you delete an existing page.
- Redirecting one page to another page (server code 301, 302, 307) – When you make a page reference to page B, page A will eventually disappear from the index.
- Temporarily remove the page using the URL removal tool in the Google Search Console.
Step 3 – Sort
The last step is to sort the web pages that are indexed, and display them as search results according to a point query. This is basically the stage where Google’s smart algorithms work, and sort the web pages according to the query typed.
For example, if someone was looking for “sneakers in Ramat Gan” – this is the process that takes place (in a split second):
- Google enters the index, which stores trillions of web pages that have already been crawled in Phase 1.
- Based on the search query typed by the user, Google finds within the index the web pages that deal exclusively with the topic of the query (i.e., only pages relevant to the phrase “Ramat Gan sneakers”).
- Google’s algorithms examine which pages are the most relevant and high-quality, based on a query, and place them in descending order in the search results. The division is into ten results per page – from the first result to the tenth; From the 11th result to the 20th result and so on.
In this way we get ten results, sorted according to the settings and parameters in Google’s search algorithm. Every time we make a change to an existing page or add a new page, Google re-weights the page contribution – and places it in a new position according to the changes.
The position in the search results depends on hundreds (and maybe even thousands) of different parameters – content, links, user behavior, Rank Brain (artificial intelligence of Google search) and so on. The more we invest in the page itself and the relevant external signals, the more likely it is to reach a higher position in Google.
Search engines are sophisticated machines, some more and some less. Today, the smartest engine belongs to Google, which invests heavily in sophisticated algorithms and artificial intelligence, in order to provide the best and most relevant results for any search.
The more you know about how search engines work, the more you can tailor your site to the requirements and get higher visibility in the