How Google Works
Chapter 11 has a great deal of information about how Google works. Here are the basics.
An autonomous piece of software, of the kind generally known as a bot or Webbot, and specifically called the Google Web crawler (or spider) searches the Web and retrieves Web pages. Google’s Web crawler operates continuously to keep its index up to date.
Meanwhile, the Google indexing software rips through the page and pulls keywords out of it. While the most important function of this software is to throw away words that shouldn’t be indexed, such as articles and prepositions (“a,” “the,” “for,” and so on), it also performs other functions.
It can be hard to know whether a Web page is on the up-and-up. (Click here for some guidelines for evaluating the credibility of Web sources.) Google does its part in helping you with this evaluation by running pages through content analysis software before you ever see the page to help determine what a page is really about.
Google’s fairly intelligent software tries to make sure that Google’s indexing analysis is not skewed by measures such as the use of phony meta tags. This hypertext-matching analysis looks at the full content of a page and. It looks at formatting, locations of words, fonts, and the subdivisions on each page to figure out the location of each word. Google even looks at the material of related Web pages to make sure that results are relevant. The engine is smart enough to know that words in larger bold fonts (such as headlines) tend to be more important in determining the content of a document, so these words are given more weight than the fine print.
A retrieved page is itself is cached, or stored, in the Google document servers (called in Googlese the doc servers, or doc server farm), along with a PageRank. (The PageRank is used as a measurement to sort documents by importance.) With the text of a document stored in the doc servers, the post-analysis keyword content of a Web page is used to populate the Google index servers. Keywords stored in the index servers point to each document that contains the term in the doc server farm.
When a user makes a search request, the Google Web server sends it on to software that analyzes the request to strip out words that are not indexed (mostly stripping articles and prepositions). It then sends the keywords in the request, with a proximity rating, on to the index server farm. The index servers, along with the doc servers
- Determine the documents pointed to by the keywords
- Sort these documents using each one’s PageRank
- Provide links to these documents on the Web
- Provide a link to view the cached version of the document in the doc server farm
- Pull an excerpt from the page, using the cached version of the page, to give a quick idea of what it is about
- Return an initial result set of document excerpts and links, with links to retrieve further result sets of matches, rendered as HTML.
By default, Google returns result in sets of ten matches (as an HTML page). You can change the number of results you want to see on the Google Preferences page.
Google prides itself on the fact that most queries are answered in less than half a second. Considering the number of steps involved in answering a query, you can see that this is quite a technological feat.
Understanding PageRank
The PageRank algorithm is used to sort pages returned by a Google search request.
The underlying idea behind PageRank is an old one that has been used by librarians in the pre-Web past to provide an objective method of scoring the relative importance of scholarly documents. The more citations other documents make to a particular document , the more “important” the document is, the higher its rank in the system, and the more likely it is to be retrieved first.
Let me break it down for you:
Each Web page is assigned a number depending upon the number of other pages that link to the page.
The crucial element that makes PageRank work is the nature of the Web itself, which depends almost solely on the use of hyperlinking between pages and sites. In the system that makes Google’s PageRank algorithm work, links are a Web popularity contest: Webmaster A thinks Webmaster B’s site has good information (or is cool, or looks good, or is funny), Webmaster A may decide to add a link to Webmaster B’s site. In turn, Webmaster B might return the favor.
Links from Web site A to Web site B are called outbound (from A) and inbound links (to B)
The more inbound links a page has (references from other sites), the more likely it is to have a higher PageRank.
However, not all inbound links are of equal weight when it comes to how they contribute to PageRank—nor should they be. A Web page gets a higher PageRank if another significant source (by significant source I mean a source that also receives a lot of inbound links, and thus a higher PageRank) links to it than if a trivial site without traffic provides the inbound link.
A link from a high PageRank page counts for more than a link from a low-ranking page.
The more sophisticated version of the PageRank algorithm currently used by Google involves more than simply crunching the number of links to a page and the PageRank of each page that provides an inbound link. While Google’s exact method of calculating PageRank is shrouded in proprietary mystery, PageRank does try to exclude links from so-called link farms, pages that contain only links, and mutual linking (which are individual two-way links put up for the sole purpose of boosting PageRanks).
|
|
Search Engine Optimization
 
Syndication Viewer
Our Web host:
IX WebHosting
|