Information retrieval systems are used to find relevant information from large data sets. Universities and public libraries use information retrieval systems to provide access to books, journals and other documents, whereas large enterprises use information retrieval systems to provide internal access to their large collections of internal documents. Web search engines (e.g. Google) are the most visible information retrieval systems. A typical implementation of an information retrieval system might include 1) a document collection subsystem, 2) an indexing subsystem, and 3) a searching and ranking subsystem.
A typical document collection subsystem (e.g. a web crawler) takes a list of document references (e.g. URLs) and retrieves documents from the locations indicated in these document references. After retrieval, the documents along with the corresponding document reference are passed onto the indexing subsystem. The documents are also parsed and any document references found within the documents are extracted. These document references are then added to the lists which the document collection subsystem uses for retrieving further documents.
A typical indexing subsystem takes the documents with their corresponding document references and uses this to create and update a searchable index, where all the associations between the documents and individual words and other data are stored in such a way as to enable efficient lookups. The documents or the words are often ranked in order to determine which documents are the most relevant to a given word.
A typical searching and ranking subsystem uses the search information (e.g. keywords typed into a Google search box) to look up in the searchable index, and retrieves and ranks the set of document references from here. Sometimes the actual documents or extracts of the documents are also part of the results.
Current state-of-the-art information retrieval systems typically focus on ranking the results retrieved by the searching and ranking subsystem using a combination of ranking information stored in the searchable index and through use of algorithms. These algorithms typically use information such as use of search terms in document names and URLs, location of search terms in documents, and popularity of documents (e.g. Google Page Rank) to determine which results are most appropriate.