1. Field of the Invention
The present invention relates to indexing and retrieving electronic documents. More particularly, it relates to systems and methods for identifying the topics of a document using data extracted from the World Wide Web, a hypermedia encyclopedia, or any other hypermedia database.
2. Related Art
Most current information retrieval systems, including Web search engines, index and search documents based on keywords rather than concepts. The term “bears” is a simple example of this distinction. If, as in most Web search engines and other information retrieval systems, capitalization is ignored, “bears” could refer to a variety of concepts: the family of mammals Ursidae, the Chicago Bears professional American football team, or players in the financial markets with a negative outlook. Such groups of terms with the same spelling but different meanings are referred to as homographs.
A deeper problem is the many concepts that occur in a document have some varying degrees of importance. A document about the history of American space exploration, for example, might reference President John F. Kennedy's speech in 1962 committing the nation to the goal of a manned landing on the moon. However, the document is less about Rice University, where the speech was given, than it is about President Kennedy, and less about President Kennedy than it is about concepts like space exploration, NASA, specific programs like the Apollo program, and so forth. In other words, conventional solutions have failed to recognize and evaluate the relative importance of concepts appearing in a document.
Addressing these and other issues to identify the concepts that are the topics of a document is the problem of topic identification. Topic identification is an aspect of information retrieval, an interdisciplinary field of computer and information sciences relating to storing, locating, searching, and selecting relevant data on a given subject.
Topic identification has a number of valuable applications. Indexing and searching Web pages and other documents by topic, preferably in combination with conventional keyword-based indexing and searching, can improve the quality of search results. Topic identification directly supports automatically tagging documents in information management system or similar application and supports more intelligent filtering of information streams such as email messages, RSS and other syndication channels, and social media feeds.
Prior work in area of topic identification has a number of significant limitations. Some approaches focus on linguistic approaches to the problem, supported with little or no semantic knowledge of concepts. Many approaches are also domain-specific (for example, specialized for documents in a domain such law, medicine, or finance), language-dependent (for example, only capable of processing documents in the English document), and require on-going training and tuning.
Accordingly, an improved system and method for identifying the concepts that are the topics of a document is highly desirable.
What is needed is an improved method of identifying a topic of a document that overcomes shortcomings of conventional solutions.