The detection and identification of entity aliases are often needed in the fields of security, business analysis and scientific research. The term “entity”, as used herein, refers to specific entities, objects or events for which information is stored, for example, a person's name, a place's name, an organization's name, a product's name, etc. An entity often has several “aliases,” which refer to additional assumed names of the entity, for example, the legacy names, the abbreviations, or the commonly misused names of the entity. For example, the organizational entity “Beijing Scientific and Technology University” has aliases including “Beijing College of Iron and Steel Technology” (legacy name), “Bei Ke Da” (abbreviation), “Steel College” (abbreviation of legacy name), “Capital Scientific and Technology University” (misused name), etc. In an ideal entity dataset, it is desirable that all of these aliases are identified and merged into one group, so that such an entity dataset can better serve various applications, such as building a data warehouse, performing client relationship management (CRM) and fraud detection. The detection and identification of entity aliases are becoming increasingly important.
The existing solutions to the entity alias problem focus on the discrimination of the identity of entities, that is, to discriminate whether two or more entities are identical from various available clues. These solutions can be divided into two categories based upon whether they use a reference dataset. In the first category, all the input entities are matched to the existing reference entities, and those entities matching with a same reference entity are regarded as aliases. In the second category, the input entities are directly matched to each other. In either solution, the core matching method relies on computing the morphologic, orthographic, phonetic, or semantic similarities of the tokens associated with the entities. For example, the phonetic-based Soundex algorithm encodes all English words by removing vowels and representing the consonants with six phonetic classifications of human speech sounds (bilabial, labiodental, dental, alveolar, velar, and glottal). The “edit distance” algorithm assumes that the differences between two strings can be measured by three kinds of writing alternations (insertion, deletion, substitution). The “behavior-based” algorithm asserts that two entities are connected if they share similar semantic links in a dataset (e.g., if two Email IDs have the same patterns of inbound and outbound emails, then they are most probably owned by a same entity).
The above two existing solutions to the entity alias problem discriminate a number of entities (the “input entities”) after having obtained these entities through certain ways, but the solutions do not concern how to get these input entities. Therefore, these solutions cannot solve the problem of obtaining a collection of all the possible aliases for a specific entity.
On the other hand, with the flouring Web 2.0 technology, social tags for Web objects are easily available from social tag websites. In a network-socialized environment, authors and readers are allowed to select their preferred “tags” for a Web object (e.g., Web pages, images, or video segments), i.e., keywords or terms associated with that Web object, and share with others. The social tags, as metadata conferred by the public to Web objects, make possible the social sharing of the network information. For example, when a reader is interested in a Web object, he can get a list of tags added by others to the Web object from the social tag websites, so that the reader may quickly determine the property and usage of the object.
However, social tags, as a tool for enabling the social sharing of network information, has not yet been utilized in collecting entity aliases.
Accordingly, what is needed is a method and system to enhance the above described characteristics of the prior art to provide for improved collection of entity aliases.