1. Field of the Invention
The present invention relates to computers and computer networks. More particularly, the invention relates to comparing semi-structured data records.
2. Background of the Related Art
Structured data items, such as data records in a database generally have schematic information (e.g., database schema) describing the structure of underlying data fields and detail definition of each item in the data fields that allows computerized comparisons. In contrast, semi-structured data may have some structure (e.g., tags or key-value markers) that allows semantic separation of its alphanumeric components but each component does not have well defined format, and therefore difficult to be compared automatically. Examples of semi-structured data include network management messages, online social network (OSN) data records, emails, voice-over-IP (VoIP) headers and transcripts.
In network management, for instance, vast amount of network data, including device logs, traps, and alarms across different devices from different vendors have their own format. While there are rough standards in data formulation (e.g., IETF standards), they only contribute in separating fields inside device messages (e.g., XMLs (eXtensible Markup Languages) of SNMP (Simple Network Management Protocol)), leaving the real job of comparing and analyzing the messages to be manually done by network operators.
A social network is a social structure (e.g., community) made of members (e.g., a person) connected by social relationships such as friendship, kinship, relationships of beliefs, knowledge, prestige, culture, etc. Members of a social network often share interests and activities relating to such social relationships. For example, individual computers linked electronically could form the basis of computer mediated social interaction and networking within a social network community, referred to as an online social network (OSN). A social network service focuses on building online communities of people who share interests and/or activities, or who are interested in exploring the interests and activities of others. Most social network services are web based and provide a variety of ways (e.g., e-mail, instant messaging service, etc.) for users (or members) to interact socially.
Matching profiles of users across OSNs is a problem of great interest. Generally, only partial user profile information is available in a single OSN. Via the profile information overlap between different OSNs, profiles belonging to the same user can be concatenated to present a more complete profile, which can benefit personalize marketing, user online behavior analysis, etc. A number of previous works assess the feasibility of matching profiles across OSNs. These methods typically require large man/machine-hour to be practical or are restrictive in looking for matches. As a result, the growing size of today's information networks poses a scalability challenge to the schemes analyzing them. While the general similarity and distance measures such as edit distance and n-gram provides simple and clear ways to parse out the textual information for a small number of data records, the growing amount of string comparisons on networks with millions of profiles becomes a limiting factor for these methods. Further, even if the comparisons can be carried out somehow, the non-contextual, blind comparison leads to poor profile matching accuracy. For example, a comparison between user names, “Mary” and “Mark”, are considered very similar under edit distance measure while “Bill” and “William” are not.
Extracting and matching personal profiles of email senders VoIP callers has similar challenges as the OSN data records.