The present embodiments relate to information relation generation. In particular, a word space of related words and/or relationship of named entities are generated.
Information retrieval (IR) may be used for monitoring vast amounts of information. For example, the entire web is monitored for information related to a subject, such as terror threat prediction or money-laundering detection. To find all relevant documents and avoid retrieving irrelevant documents, it is important to have a list of search terms that are both precise and comprehensive.
Generation of search terms may be challenging. For example, money laundering is a complex concept that involves many different and seemingly independent processes, such as a crime of some sort and a monetary investment involving the same set of persons or organizations. Creating relevant search terms is even more challenging for applications that forecast rare events (e.g., plant failures or terror threat) since it is futile to search for the event itself.
A human domain expert manually creating a list of search terms may be tedious, time consuming, error-prone and expensive. For automating word relatedness, relatedness or similarities may be based on lexicons and word ontologies. These relatedness measures are based on distances (edges or relations) between words in human-generated word ontologies, such as Wordnet or MeSH2. Corpus-based methods have been used for finding similarity between words based on the collocation (a words usage in a given dataset). Combined lexicon-based similarity and corpus-based similarity has been proposed. Corpora and thesauri with precomputed similarity of all word pairs in the corpus enables users to query the corpus with a single word and get all words that are similar (or related), along with the similarity scores.
Building word spaces for real world applications still faces a number of challenges. Automated methods may result in very large lists that are noisy (e.g., words less related to a concept are included). Inspection of the entire list to remove the noise may require O(n) time, where n is the number of terms in the expanded word space. Setting a single score threshold across all seed term expansions may not work as not all seed words are equally related to a concept. For example in the concept of money-laundering, “crime” and “investment” are seeds. Crime and the terms expanded from crime are closer to money-laundering. Using a threshold may remove words more relevant than words that are not removed.
Relationships also exist between entities in a document. A named entity (NEs) is an object with a name. For example, persons, organizations and locations are entities with specific names. Mining relations between named entities may be useful for constructing knowledge bases with information about the named entities. Relation mining can enable a Question-Answering (QA) system to answer questions such as “who is X married to?” by looking for a spouse relation between X and other named entities of type “person.” Relation mining between NEs may provide a social graph structures. Risk associated with a person can be calculated using his/her primary or higher order associations with risky persons or organizations. However, effectively capturing relationships between named entities may be difficult.