How does Google work?

Jun 4, 2009

This is something I wrote at work.

Anyone who has spent some time on the internet would have used some search engine. But do we really search? Or do we ‘Google’? Google has become synonymous with search. But we hardly spare a thought to understand how Google works. Or for that matter how does search work. It really doesn’t matter to most as long as the likes of Google, Yahoo! and Microsoft can dish out relevant results for our queries.

But this question matters to those who are interested in search marketing, to those inquisitive kinds who would not shy away from some extra knowledge and to those who would like to appreciate the incredible technology that goes behind fetching those results. How does a search engine work? What happens when you enter a query? How does the search engine fetch the relevant results? And how are the results ranked? These are some of the questions that I attempt to answer in this article.

What is a search engine? A search engine is a program that automatically browse the world wide web methodically, stores and indexes the browsed data and then allow users to query that data to provide as far as possible relevant results.

The entire search begins long before you have even thought of something to search. It begins with creating an inventory of pages in the search index. The search index comprises of all possible keywords mapped to the websites which contain those keywords. However, to save space, the index does not store the webpage urls, but a unique document ID that identifies those urls in a separate database.

The construction of this search index begins with a spider (or crawler, or bot). The spider starts by examining web pages in a seed list but then discovers sites on its own by following links. The spider identifies links by checking the HTML code of the web pages it visits. Thus, theoretically, given enough time, a spider can find every page in the web (at least those that are linked to at least another page). But that is purely theoretical. Various researches to find how much of the web is actually indexed throw up widely divergent numbers from 0.03% to 16% of the web.

While crawling is probably the most efficient way of discovering web pages, it is definitely not the most efficient when it comes to discovering changes made to a web page. This is simply because there is no surety when the spider will return to a site. By then a web page could have changed dramatically or even ceased to exist. Once the spider has found a web page and added it to the index, it is time for the search engines to analyze those pages.

That is just about the simplest description of what a search engine does to build the search index. Crawling, indexing and analysis could very well be the topic of a dedicated article. But that is not the point of this one. So let’s move ahead to find out what happens when you actually enter a query.

Once you have typed in your query and clicked on the search button (or pressed the enter key), the search engine starts by matching the search query to pages in the search index. The first step in the process is to analyze the query. The search engine examines each word in the query to find the best web pages in the search index that match. Analyses of search queries involve finding word variants, correcting spellings, detecting phrases and antiphrases (words such as ‘what’, ‘is’, ‘the’), examining word order and processing search operators.

Once the analysis is done, the next task for the search engine is to decide which results to present. With hundreds of thousands of possibilities this is a tough task. This is where the search index comes to use. The search engine uses this index to locate the matching pages depending not only on the query as entered by you but also any word variants (e.g. ‘mouse’ and ‘mice’) and words to ignore.

Now comes the most interesting and challenging phase of the search engine’s job. Ranking the matching pages. This is where the ranking algorithm comes to play (the most famous of which is Google’s PageRankTM algorithm). Ranking, very simply put, is just sorting by relevance. There are a variety of factors that go into consideration while ranking the matching pages. These include keyword density, keyword proximity, keyword prominence and link popularity. Link popularity has emerged as the most popular factor in ranking since it can act as a surrogate for quality and reliability.

Sounds simple? This is what Google has to say about their PageRankTM[1]: “We use more than 200 signals, including our patented PageRank™ algorithm, to examine the entire link structure of the web and determine which pages are most important. We then conduct hypertext-matching analysis to determine which pages are relevant to the specific search being conducted. By combining overall importance and query specific relevance, we're able to put the most relevant and reliable results first”.

So now you know what happens in those milliseconds after you type in your query and hit the enter key and the search engine presents the results to you. This article is more of an attempt to enlighten as many as possible to the intricacies of a piece of technology that has become so ubiquitous in our lives.

PS: About Google’s PageRank™ - PageRank™ mainly relies on the ‘democratic nature’ of the web by using its vast link structure as an indicator of an individual page's value. Important, high quality sites receive a higher PageRank™. So, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at a lot more than the sheer volume of links a page receives. For example, it also analyzes the page that casts the vote and votes by pages that are important weigh more heavily and help to make other pages important. A site’s rate of link acquisition, the longevity of a link, the text used for the link, whether it’s a ‘deep link’ or to the homepage and whether anyone clicks on the link seem also to count.

That’s about all that we know about PageRank™, the rest of the mystery is safely secured in Mountainview.


[1] http://www.google.com/corporate/tech.html

A New Beginning

Apr 25, 2009

Life after ISB is strange. And it has become even more so since I started working. Yeah. Back to the grind :)

No more late night assignments and early morning submissions. No more hanging around at the Class of 08 Lounge partying. No more Goel bashing. No more the comfort of the campus, the wonderful house-keeping support or the AC. No more high-speed, uninterrupted, access-anywhere internet.

But no one can expect to live in paradise forever. When we were coming out from campus there was a banner attached to the main gate. It said

All the best…The real world awaits you…

Couldn’t have been more correctly put. The real world welcomed me with 40+ C temperature with intermittent power cuts in Calcutta and heavy rains and horrible traffic ( 60 minutes for 5 km on occasions ) in Bangalore.

Meanwhile I started work, rather joined, last week. I am quite excited about the work. It is in search marketing. The fact that I have no clue of the domain makes the prospect even more exciting. All that is left now is to settle down in Bangalore a second time. Be back soon.

It is Over - Part 2

Apr 5, 2009
The time is gone, the song is over, thought I'd something more to say

It is Over - Part 1

Apr 4, 2009
Finally I am an MBA (or may be a MBA). One year of great fun and great rigor is finally over. The only thing that is left is the graduation party later tonight.

This is going to be my last post from the campus. Will be heading back home tomorrow to find out what lies ahead. One adventure over, another is just about to begin. Till then adieu.

The End

Apr 3, 2009
This is the end
Beautiful friend
This is the end
My only friend, the end

Of our elaborate plans, the end
Of everything that stands, the end
No safety or surprise, the end
Ill never look into your eyes...again

Can you picture what will be
So limitless and free
Desperately in need...of some...strangers hand
In a...desperate land


Morrison just about describes how I feel right now. May be some people won't agree with me, but still. I am almost an MBA now. The graduation ceremony is all that is left. That is tomorrow.

It is all getting a bit nostalgic now. This has been one awesome year. All the 5 AM assignment submissions, the dunkings, the study group meets, assignments, handouts, blackboard, CP, deadlines. All over. We now go back to our old lives. It is not going to be too easy.

As I still write, I am sitting with SD, SH and SM. Just enjoying the last few moments of what is left of ISB. This has been one hell of a year. Not to be forgotten ever. I will not go as far as claiming this has been the best yet since I still have my engineering history to consider. But then I would be probably comparing apples and oranges if I try to compare engineering and MBA. And that is something I have learnt not to do during MBA.

We are now more mature than we were during engineering. And hence we have a better choice of who we make friends. Ergo, I believe, that the friends I have made here are probably closer to me than those from engineering. This is in no way to undermine the beautiful friends I have from engineering who I am sure would do anything I would request from them. But all this is beside the point.

This has just been one of the greatest years of my life. There have been lots that has happened since my last post. But I don't want to spoil my feelings right now by delving into what has been. I would just love to live the rest of my life with the people I have so enjoyed being with. This is a tribute to them.

The core group - SD, SH, SM and YM.
Special mention - DD, AV, RB, RR, AJ and RK.

Interesting Question

Feb 27, 2009
What would you do if you knew you could not fail ?

Delhi 6

Feb 25, 2009

I am a little late than promised with writing about Delhi 6, so sorry. Term 8 has started. Suddenly going back to class after a 17 day break caught me napping. In between was glued to the TV screen watching Slumdog conquering the Oscars. Now back to Delhi 6.

Coming after the resounding success of RDB, Delhi 6 has to keep up to a lot of expectations. Which, I felt, it does quite well. Most of the people I went to watch Delhi 6 with was complaining about the ending of the movie. I agree it was a little too stretched and melodramatic. But then RDB had the same flaws. Otherwise it is not right to compare the two movies. RDB was more radical and youth focused. Hence, it is certain to attract a lot of attention. Delhi 6 is about characters. Each of which is different, balancing their own complexities. Such a movie requires great performances more than a great story. And Delhi 6 is great in this respect. The performances of all the characters are great. Though not so much Abhishek Bachchan though. His character does not develop too much through the movie and it does not seem there is any conflict within him. May be it is the character or his execution.

On the other hand Sonam Kapoor is marvelous. Her stellar performance as a character caught between marrying the groom of her father's choosing and becoming the Indian Idol is the mainstay of the movie. Besides her other the other characters in the movie, especially Waheeda Rehman, Pawan Malhotra and Vijay Raaz, too excel in their roles. Then there's Rishi Kapoor who is proving to be a safe bet for any kind of character these days. Be it the film producer of Luck By Chance or the father's friend in Delhi 6.

However, despite all the great performances in Delhi 6 the best of them do not even appear in the movie. A R Rahman. Two Oscars in the bag, but Delhi 6 is way better than Slumdog.

PS: There is this little bit of Harry Potter inspiration in the movie as well. In a scene reminiscent of Harry's conversation with Dumbledore in an empty King's Cross station after Harry's battle with Voldemort in the Deathly Hallows, Abhishek meets his grandfather, played by his father, when he is shot by Mamdu. Was that deliberate? Or did the director think there was not enough overlap between Potter fans and the people who would see Delhi 6?

Color, Black and White

Feb 21, 2009



Sunset in Europe



Ennui

I have nothing to do. Since coming back from Calcutta (was there for a couple of days to attend my cousin's marriage) I have entertained myself to nine movies in 3 days. 
  • Indian Jones and the Kingdom of Crystal Skull
  • Underworld
  • Underworld Evolution
  • The Ruins
  • The DaVinci Code
  • 30 Days of Night
  • Max Payne
  • Pirates of the Caribbean - At World's End
  • Delhi 6
Please do not ure me for the choice of movies. My choices are limited by whatever is available on the network here before IT takes action. I will write about Delhi 6 tomorrow. 

PS: SM and I want to make a movie on Calcutta. A good movie that shows the true Calcutta beyond what is generally accepted. Anyone willing to finance it?? Please leave a comment if you can trust us :)