Traditional search engines have three basic components: a crawler, an indexer, and a user interface. The crawler is a program which starts with a seed or source URL (Uniform Resource Locator), and scans a web page associated with the seed URL by traversing each of the links on the web page, and submitting each of the traversed links to the indexer. The crawler then scans each web page associated with the traversed URLs or links, to determine further links, and the process is thus repeated recursively. The process is repeated until it is stopped manually by a system administrator, it reaches a maximum pre-determined crawl time, or it has traversed all the URLs it could find. A shortcoming of the crawling process is that web pages which are not linked to by other web pages may easily be overlooked, and therefore not indexed. Also, the crawling process can take weeks or months.
The indexer is a program which scans words or other content of the traversed web pages to populate a massive database called an index. The user interface (also known as a search engine) is a program which presents an Internet user or searcher with an input medium to enter search criteria, for example keywords or media type. The search engine program checks the index against the search criteria to return a set of relevant search results. Typically, a list of search results pages (SRP) is returned, which includes all the web pages or documents matching the search criteria of the user.
First generation full-text search engines rank the search results based on a statistical analysis of word relationships of the matched document, i.e. based only on the content of the document itself. The statistical analysis is based on number of phrases in the document which match the search criteria, size of the document, proximity and location of the matching search criteria to one another, etc. Examples of first generation search engines are AltaVista, and Excite.
Second generation full-text search engines, for example Google, look beyond the matching document to determine the rank or the search results. Google uses PageRank, which determines how many external web pages link to the matching document. The theory behind PageRank is that more important or relevant documents are referred to or linked to more often by highly PageRanked or important external web pages. A shortcoming of PageRank is that the links in the external web pages could be outdated or obsolete, and the external links do not reflect current popularity of the document.