Identifying the focus of a text document such as a Web page, a news article, an email, etc. is beneficial in many situations. One such situation is in data mining systems in which information is searched for throughout a large number of documents. A means of determining a focus of a document automatically, to enable a search by document topic for example, would be extremely useful.
Using knowledge organized in hierarchical structures is useful for text analysis. In computer applications this knowledge is often modeled by graphs—networks of interconnected concepts. For example, geographical locations lend themselves to be easily organized into a hierarchical tree—a “concept tree”—where each city has a parent country, which has a parent continent, and so on, all the way up to a common root element. Similarly, employees in an organization can be arranged into a hierarchical management structure, where managers themselves have managers and a list of subordinates.
The example of geographic focus is used throughout this document to illustrate a type of clearly-defined focus which can be expressed in hierarchical form. However, this should not be construed as limiting the scope of this disclosure and is merely used as an example of a type of focus. The types of focus are wide-ranging and include any topic which can be expressed in a hierarchy; for example an employee's reporting structure. To accomplish the goal of determining the focus of a document, an understanding of the topics in a document is needed. This is usually inferred from an analysis of the words used in the document, performed by some form of Natural Language Processing. However, words are ambiguous and the same word or term might refer to different concepts. In the case of geographic topics, confusion can arise if there exists several places in the world with the same name, or where a place name is also a common word or an individual's name, etc. For example, when finding the geographic focus of a document, if we take the term ‘Dublin’; it is known that there are multiple locations in the world with the name ‘Dublin’ and thus the term ‘Dublin’ may be confusing and the ambiguity caused by this needs to be resolved—i.e. the term needs to be disambiguated.
To do this, a data mining algorithm parses a document and maps each term in the document to a pre-existing concept tree in order to find the focus of the document. A graph clustering algorithm establishes the central concept of the document i.e. that a central concept or focus of the document is of a geographical nature. Next any ambiguous terms i.e. where there are occurrences of the terms like ‘Dublin’, or ‘Galway’, must be resolved—i.e. are the terms ‘Dublin’ and ‘Galway’ referring to the cities in Ireland or those in the U.S.A? The step of resolving ambiguous terms based on the metric of their theoretical similarity to the document's focus is called term disambiguation.
There are a number of known prior art methods for finding the focus of a document and for providing term disambiguation. However, normally different methods are suggested for the tasks of finding the focus of a document and term disambiguation—please refer to Wu and Palmer, 1994 “(Verb semantics and lexical selection”, 32nd Annual meeting of the Association for Computational Linguistics, Las Cruces, N. Mex., 1994, pp. 305-332) and Leacock and Chodorow 1998 “(Combining local context and WordNet similarity for word sense identification”, In C. Fellbaum (Ed.), Wordnet: An electronic lexical database, MIT Press 1998, pp. 265-283).
The time complexity of these suggested algorithms limits their utility because of the increased time it takes to find the focus of a document and then perform term disambiguation. This increased time complexity prohibits use of these algorithms in important industrial applications, where the number of nodes on the concept tree may be large and thus the calculation of the central concepts of even a single document may become infeasible within a reasonable time.
Thus there is a need in the art to provide an algorithm that can find the central concept of a document and term disambiguation in a time-efficient manner without the complexities of the prior art.