Using knowledge organized in structures is useful for text analysis. In computer applications this knowledge is often modeled by graphs—networks of nodes (also called vertices), which represent concepts, interconnected by links (sometimes called arcs or edges), which represent relationships between the concepts. Such concept graphs, sometimes called ontologies, already exist for knowledge in many domains, for example: geography, medical, social networks (e.g., people networks in corporations), as well as lexico-semantic resources.
Key terms in a document can usually be recognized by some form of Natural Language Processing, for example. However, many words or expressions are ambiguous, with a single word or term having a number of different possible meanings. For example, the term ‘Dublin’ may refer to one of a number of possible locations throughout the world; or a mention of the word ‘Baker’ may refer to the word's common English meaning, but may also be used as a surname. Graph mining techniques are often used in order to try to disambiguate such terms.
Typically, such techniques map mentions of terms from a clause, sentence, paragraph or the whole text of a document to one or more concepts represented by a concept network. Then a graph clustering algorithm is used to establish the central concept, also known as the focus, of the text. The focus is then used to disambiguate any ambiguous term references in the text. An example of geographical disambiguation may help to explain. Suppose a text is found to mention of “Dublin,” “London,” “Athens,” and “Greece”. It is unclear whether “Dublin” refers to Dublin/Ireland/Europe or to Dublin/CA/USA. Likewise, “London” and “Athens” are also ambiguous (London/OH/USA or London/UK/Europe and Athens/GA/USA or Athens/Greece/Europe). If an analysis of clusters in the graph based on the concepts mentioned in the text shows that the strongest cluster is centered around Europe, this is deemed to be the focus concept. If there is only a single geographical focus in the text (as in this example) all ambiguous geographical terms are disambiguated relative to this focus as they are most likely related to it. In the case of this example, the result is that Dublin/Ireland/Europe, London/UK/Europe and Athens/Greece/Europe are chosen (disambiguated) as the concepts to which the geographically ambiguous terms “Dublin,” “London,” and “Athens” refer.
Whilst the example of geographical disambiguation is used throughout this document to illustrate an application of graph mining techniques, this should not be construed as limiting the scope of this disclosure. Such techniques may also be used for other types of term disambiguation, including word sense disambiguation, as well as for other applications, such as semantic tagging, and social network analysis.
To find document keywords and provide term disambiguation, many different methods have been suggested. Many use graph clustering algorithms in which nodes are allocated to clusters according to some similarity metric. These usually require a pair-wise comparison of nodes in order to determine a conceptual similarity between nodes. The time and processing complexity of such algorithms can be extremely high, sometimes being proportional to the square of the number of nodes in the concept network. This presents a scalability problem which prohibits use of these algorithms in many important industrial applications where the number of nodes on the concept network may be large and thus the calculation of the central concepts of even a single document may become infeasible within a reasonable time.
In some concept networks information about nodes and/or about the relationship between two concepts may be included by way of attributes of the node and/or the link between the two nodes representing those concepts. Graph mining techniques have been developed which use such additional information to favor certain properties. For instance, in a geographical disambiguation, geographical locations with larger population could be given a greater preference. This would favor London/UK/Europe (population 7.5 million) over London/OH/USA (population 9000). Similarly different types of links between nodes can be favored or disregarded entirely depending on the application.
Such techniques are described in “Web-a-Where: Geotagging Web Content” Proceedings SIGIR'04, Sheffield, UK, and the applicant's co-pending U.S. patent application Ser. No. 11/611,871. These techniques overcome some of the time complexity problems referred to earlier and allow the use of some additional information attached to nodes. However, these papers use hierarchical clustering algorithms which rely on a tree-structured topology in order to simplify the required processing.
Whilst some types of knowledge, such as graphical locations or organizational structure, lend themselves to be easily organized into a hierarchical tree of parent and child nodes, where each city/employee has a parent country/manager, which has a parent continent/manager, and so on, all the way up to a common root element, many ontologies exist which cannot be represented by such a hierarchy. For example, an organizational structure, which in addition to identifying the management structure, also links employees according to other relationships such as membership of certain departments, locations etc. Another example is a semantic network which represents the semantic relations between a huge number of concepts or entities. In addition to the subset/superset relation between entities, which might be expressed as a parent/child relation in a tree-like structure, there are many other semantic relations, for example the ‘part of’ relation, which indicates that entity A is a part of entity B. This means that entities may have more than one ‘parent’. As well as these relations, ontologies often include additional types of relation that further refine the semantics they model, and which may be domain-specific.
When moving from structured hierarchies to generic graphs complications, such as loops (where an outgoing link refers to a node which refers back to the original node), are introduced. Such looping behavior is common in semantic ontologies and social networks, and cannot be handled by the previous art.
Thus, there is a need in the art to provide a method and system which addresses these and other problems.