1. Field of the Invention
The present invention generally relates to search engines. In particular, the present invention relates to techniques by which search engines generate abstracts to represent the content of documents, such as Web pages, identified during a search.
2. Background
A search engine is an information retrieval system designed to help users find information stored on a computer system. Search engines help to minimize the amount of time required to find information as well as the amount of information that must be reviewed by a user of the engine. The most public, visible form of a search engine is an Internet search engine that searches for information on the World Wide Web.
A conventional Internet search engine is configured to receive a user query in the form of one or more search terms and to identify relevant Web pages based on the query. A list of the identified Web pages, typically ordered from most relevant to least relevant, is then presented to the user via the user's Web browser. By way of example, FIG. 1 depicts a user interface screen 100 of a conventional Web browser that displays a list of search results associated with the user query “digital camera.”
As shown in FIG. 1, information about each Web page identified during the search is presented to the user in a structured format. The structured format includes a title associated with the Web page, an abstract that summarizes the content of the Web page, and a Uniform Resource Locator (URL) associated with the Web page. For example, as shown in FIG. 1, a particular search result includes a title 102, an abstract 104, and a URL 106, each of which is associated with the same Web page.
The abstract generated by the search engine is intended to provide a concise summary of the content of a Web page in a manner that focuses on information that is most relevant to the user query. By reading the abstract, the user should be able to determine whether the identified Web page actually includes content in which the user is interested. In contrast, Web page titles and URLs rarely include enough descriptive information to make this determination. Consequently, abstracts form a critical part of the search results, particularly when the user query is very general in nature. The failure to provide a clear and coherent abstract that accurately represents relevant Web page content can significantly impair the user experience associated with a particular search engine.
However, there are numerous challenges that must be dealt with in order to generate a high-quality abstract. For example, although abstracts consume a large amount of screen real estate relative to other portions of the search results, they must still be limited in size to ensure that a reasonable number of Web pages can be listed in the browser window. For example, some search engines limit abstracts to approximately 150 characters. Abstract generation algorithms must therefore be programmed to use this limited space intelligently such that only the information that best summarizes the Web page content and that is most relevant to the user query is presented. This in turn means that the abstract generation algorithm must be able to locate such information within the Web page.
Furthermore, once an abstract generation algorithm has located such information within the Web page, it must also assemble that information in a form that is easily understood by the user and that complies with the size constraints imposed by the user interface. This can be difficult, for example, if the content being used to build the abstract is too lengthy or if the content includes disconnected text fragments that are obtained from different portions of the Web page.
Additionally, search engines typically generate abstracts at run time so that the abstract generation algorithm can take into account the search terms included in the query. Since abstract generation occurs at run time, it must be performed in a fast and efficient manner. This imposes a significant constraint on the complexity of the abstract generation algorithm used by the search engine.
What is needed then is an abstract generation algorithm for a search engine that is capable of generating an abstract that accurately represents relevant Web page content. To this end, the desired abstract generation algorithm should be able to locate information within a Web page that best summarizes the Web page content and that is also most relevant to a user query. The desired abstract generation algorithm should also be able to assemble such information in a form that is easily understood by the user and that complies with size constraints imposed by a user interface. Finally, the desired abstract generation algorithm should be programmed to operate in a fast and efficient manner that satisfies run time constraints associated with the search engine.