The present disclosure relates generally to the field of segmenting social media users (such as users of a social media network) by means of life event detection (such as based upon social media messages and/or postings) and entity matching. In various embodiments, systems, methods and computer program products are provided.
Social Media Networks (“SMN”), such as TWITTER and FACEBOOK, engage thousands of people that post, on a daily basis, a huge amount of content represented by texts, images, videos, etc. (see Ehrlich, K., and Shami, N. S. Microblogging inside and outside the workplace, in ICWSM (2010); and Kwak, H., Lee, C., Park, H., and Moon, S. What is twitter, a social network or a news media? in Proceedings of the 19th international conference on World wide web (New York, N.Y., USA, 2010), WWW '10, ACM, pp. 591-600). Often the content can be intimately related to the person that publishes it, in such a way that the content can expose behavioral traits and/or events that are happening in the individual's life. As a consequence, the proper exploration of this type of content not only can be a way to better understand the users on SMNs, but also can leverage many applications that require adequate user profiling (for instance, credit risk analysis, marketing campaigns, and personalized product and/or service offers).
One way to find potential customers for services and/or products is by detecting life events from public user activities on SMNs (e.g., in special microbloggings). Generally, a life event can be defined as something important that happened, is happening, or will be happening, in a particular individual's life, such as getting married, getting divorced, school graduation, having a baby, someone dying, buying a house, travel and a birthday (or any other person-specific and/or seasonal event or moment). That is, if a life event is properly detected, a product and/or service can be offered to someone even before he or she looks for it (anticipating his or her needs). For instance, if a person posts on the SMN that her marriage will be happening in a few days (or weeks or months), a loan or an insurance (for the honeymoon trip for example) can be offered to her in advance. Furthermore, as stated in Eugenio, B. D., Green, N., and Subba, R. Detecting life events in feeds from twitter. 2012 IEEE Sixth International Conference on Semantic Computing 0 (2013), 274-277, marketers know that people mostly shop based on habits, but that among the most likely times to break those habits is when a major life event happens.
For this reason, embodiments described herein focus on mechanisms that can detect life events from textual posts on SMNs, and that can match the corresponding users with an existing database (e.g., entity matching with current clients), using basic information such as, for example, the name and the location available on the SMN. Entity matching is important to understand whether a given user of a SMN is already a customer or not, and adapt the way the person can be approached.
Both life event detection and entity matching are complex tasks which are subject of various research in fields such as artificial intelligence, machine learning (see Eugenio, B. D., Green, N., and Subba, R. Detecting life events in feeds from twitter, 2012 IEEE Sixth International Conference on Semantic Computing 0 (2013), 274-277), natural language processing and large scale analysis of unstructured data, popularly known as Big Data (Lin, J., and Dyer, C. Data-Intensive Text Processing with MapReduce; Claypool Publishers, 2010). Performing natural language processing on microbloggings' posts presents several challenges, such as dealing with the short and asynchronous nature of the messages (making it difficult to extract contextual information), and dealing with a very unnormalized vocabulary (due to the frequent use of slangs, acronyms, abbreviations, and informal language often with misspelling errors) (see Atefeh, F., and Khreich, W. A survey of techniques for event detection in twitter, Computational Intelligence (2013), n/a{n/a; Felt, A. P., and Wagner, D. Phishing on mobile devices, in In W2SP (2011); and Liu, F., Weng, F., and Jiang, X. A broad-coverage normalization system for social media language, in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers—Volume 1 (Stroudsburg, Pa., USA, 2012), ACL '12, Association for Computational Linguistics, pp. 1035-1044). Nonetheless, one study that supports the possibility of detecting life events from textual posts has been presented in De Choudhury, M., Counts, S., and Horvitz, E. Major life changes and behavioral markers in social media: Case of childbirth, In Proceedings of the 2013 Conference on Computer Supported Cooperative Work (New York, N.Y., USA, 2013), CSCW '13, ACM, pp. 1431-1442. In that work, the authors conducted a study on the behavior of mothers during pregnancy, and they observed that these mothers can be distinguished by linguistic changes captured by shifts in a relatively small number of words in their social media posts.
In light of this, described and evaluated herein are various solutions to tackle the life event detection problem (along with subsequent entity matching). For the first task, described is a hybrid system combining rules and machine learning (“ML”). In contrast to the system specifically focused on life event detection presented in Egenio, B. D., Green, N., and Subba, R. Detecting life events in feeds from twitter. 2012 IEEE Sixth International Conference on Semantic Computing 0 (2013), 274-277, which uses only ML, various embodiments disclosed herein allow for dealing with the life event classes independently.
In one example, the rule-based phase acts as a mechanism to filter most posts that do not contain life events (since all those posts not matching the desirable rules are eliminated). Then, binary classifiers (e.g., one for each type of life event) are applied to validate the possible life events. For entity matching, a combination of string distance functions is used in this example to compare the names and locations of the users.
Since various embodiments described herein comprise a hybrid solution including an ML-based classifier that is integrated with an entity matching solution, additional discussion of background and related work is presented separated for both as follows.
More particularly, with respect first to life event detection (as already mentioned) a life event can be defined as something important regarding the user's life in one or more SMNs. In this regard, it is important to differentiate such a life event from some related work which uses the event detection expression to refer to the problem of detecting an unexpected event exposed by several users in one or more SMNs (like a rumor, a trend, or emergent topic). In contrast, in the case of various embodiments of the present disclosure, detection means are provided to classify a short post (like TWITTER'S or FACEBOOK'S status messages) in one of the life event categories (which could be considered, for instance, topics). Therefore, as related work, any approach of topic classification of short messages could be considered (for example, Eugenio, B. D., Green, N., and Subba, R. Detecting life events in feeds from twitter. 2012 IEEE Sixth International Conference on Semantic Computing 0 (2013), 274-277). Regarding ML-based solutions, other supervised or unsupervised methods for topic classification are also related, although not yet typically used for short messages but, rather, long documents. And regarding semantic-rule-based solutions, Annotated Query Language (AQL) rules combined with dictionaries are known approaches for topic classification with the usage of templates. Ontologies have also been applied for long documents.
With respect now to entity matching, in SMNs there are two problems one can find entity matching solutions for. One is, given a set containing user features on SMNs (like user information and activities), and another set containing real people information, the goal is to try to match the users within both sets. The second problem is, given two sets containing user features on two different SMNs, the goal is to try finding corresponding users, i.e., the biggest possible number of social profiles that refer to the same person between both social networks. The latter can also be called entity resolution (ER) problem, and in the past few years some work has been proposed to solve this problem. For instance, Peled, O., Fire, M., Rokach, L., and Elovici, Y., entity matching in online social networks, in Social Computing (SocialCom), 2013 International Conference on (September 2013), pp. 339-344 proposed supervised learning techniques and extracted features to build different classifiers, which were then trained and used to rank the probability that two user profiles from two different online social networks (OSNs) belong to the same individual.
The former problem can be considered a subset of the latter if the fact that the second set contains real people information rather than SMN's profiles is ignored. And generally, as summarized by Raad, E., Chbeir, R., and Dipanda, A., User profile matching in social networks, in Network-Based Information Systems (NBiS), 2010 13th International Conference on (September 2010), pp. 297-304, there are two approaches for handling this: (i) syntactic-based similarity approaches (providing exact or approximate lexicographical matching of two values); and (ii) semantic-based similarity approaches (used to measure how two values, lexicographically different, are semantically similar). For instance, Foaf-o-matic (http://www.foaf-o-matic.org/) and OKKAM (http://www.okkam.org/) projects aim at social profiles integration by means of formal FOAF (Friend-of-a-friend) semantics.
Regarding a syntactic-based similarity approach, summarized here are certain ones typically used for Uniform Resource Identified (URI), numeric-based attributes and, in the context of SNMs, two users' full names. Levenshtein or Edit Distance (see Levenshtein, V. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady 10 (1966), 707) is defined to be the smallest number of edit operations, inserts, deletes, and substitutions required to change one string into another. In addition, Jaro is an algorithm commonly used for name matching in data linkage systems. A similarity measure is calculated using the number of common characters (i.e., same characters that are within half the length of the longer string) and the number of transpositions. Winkler (or Jaro-Winkler) improves upon Jaro's algorithm by applying ideas based on empirical studies which found that fewer errors typically occur at the beginning of names (see Cohen, W. W., Ravikumar, P., and Fienberg, S. E. A comparison of string distance metrics for name-matching tasks, pp. 73-78; and Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., and Fienberg, S., Adaptive name matching in information integration, IEEE Intelligent Systems 18, 5 (September 2003), 16-23).
Another approach is the N-Gram name similarity, in which N-grams are sub-strings of length n and an n-gram similarity between two strings is calculated by counting the number of n-grams in common (i.e., n-grams contained in both strings) and dividing by either the number of n-grams in the shorter string (called Overlap coefficient), or the number of n-grams in the longer string (called Jaccard similarity), or the average number of n-grams in both strings. 2-grams and 3-grams have been used to calculate the similarity between the two users' full names. Finally, the Vector Name Matching (VMN) similarity approach proposed by Vosecky, J., Hong, D., and Shen, V., User identification across multiple social networks, in Networked Digital Technologies, 2009. NDT '09. First International Conference on (July 2009), pp. 360-365) was designed for full and partial matches of names consisting of one or more words. VMN supports the case of swapped names and the cases of partial matches.