Search or information retrieval systems are common tools enabling users to find desired information relating to a topic. Search engines or other search systems are often employed to enable users to direct user-crafted queries in order to find desired information. Unfortunately, this often leads to frustration when many unrelated files are retrieved since users may be unsure of how to author or craft a particular query, or because queries may be ambiguous. This often causes users to continually modify queries in order to refine retrieved search results to a reasonable number of files.
As an example of this dilemma, it is not uncommon to type in a word or phrase in a search system input query field and retrieve several thousand files—or millions of web sites in the case of the Internet, as potential candidates. In order to make sense of the large volume of retrieved candidates, the user will often experiment with other word combinations to further narrow the list since many of the retrieved results may share common elements, terms or phrases yet have little or no contextual similarity in subject matter. This approach is inaccurate and time consuming for both the user and the system performing the search. Inaccuracy is illustrated in the retrieval of thousands if not millions of unrelated files/sites the user is not interested in. Time and system processing are also sacrificed when searching massive databases for possible yet unrelated files.
It is important to ensure that the documents displayed to a user be ordered according to relevance, with the most relevant displayed first. In some applications involving search over large collections of documents, such as search within a company's corporate domain, human editors review the most common search terms and select documents that should be displayed in future in response to those query terms (e.g., using keyphrases). For example, the human editors might select solutions to common problems experienced by users. As can be appreciated, manual processing over hundreds or thousand of terms can be time consuming and inefficient.
A user's understanding of a collection of documents can be greatly enhanced by a summary of the contents of subsets of the collection. The collection of documents can include, for example, word processing documents, emails and/or web pages. The summary can identify the contents of subsets the collection with one or more keyphrases.
Conventional methods for generating lists of keywords from documents have operated on a single document at a time. Further, conventional methods have been trained on specific domains and hence do not translate well to different domains.