1. Field
The present disclosure relates generally to concept-level user intent profile extraction and applications to monetization and user-engagement enhancement in large-scale social media platforms.
2. Related Art
In the online world there is a major need to be able to understand and create temporally evolving profiles of users, and how they interact with the various institutions and activities, both online and in the real world. If such understanding and profiling, both at the individual user level and at the collective level of groups of users, can be achieved, then the various service providers (e.g., social media sites, online advertisers, offline stores and organizations) can use automated algorithms to serve the right information, content, and services to every individual and organizations (i.e., groups of users) in the right context and at the right time. The only kinds of information available online are the individual user actions, and the kind of structured data they share with various social media and other sites that they register with voluntarily. The structured data shared, e.g., one's place of residence, education level and degrees obtained, professional credentials, and their explicitly stated friends, email contact lists, and followers on social media and news sites, etc. is easy to categorize and collect and is being stored and heavily utilized and mined by various online entities such as social networking and media sites, including Facebook, Twitter, LinkedIn, Google+ etc.
The majority of user actions, however, are unstructured and when aggregated, comprises of billions of atomic or elementary actions, per day such as (i) user's Votes or Likes for articles, posts, or other users' posts and activities, (ii) searches done at major search engines and at individual sites, (iii) articles and web pages browsed, and (iv) posts on social media and networking sites and other interactions made among friends on such sites. For example, not all friends are created equal, and one shares different types of information and activities with different sets of friends and colleagues. Such preferences are not explicitly expressed and defined, but rather can only be inferred from the content of the posts shared and liked, and the locations visited together and can evolve over time.
One computationally challenging problem is how to make sense of individual users, and of groups of users collectively, from the billions of such seemingly diverse elementary actions and the available structured data. Is it possible to create a unified informational and functional view of individual users and groups of users that is granular enough to capture all aspects of behavior and preferences, and can evolve over time to be able to track a user's evolving needs and interests? Others have tried to accomplish such a task at different levels of granularity and with varying success, but a comprehensive and a computationally scalable solution has not been proposed.
For example, in the existing art detailed structured databases are created based on the explicitly stated attributes of users. This may include, age, gender, place of residence, education and schools attended, favorite institutions, such as sports teams, favorite, TV shows, music and music artists, celebrities, preferred types of food etc. These are valuable information but the expressive capabilities of such explicitly stated categories are known to be very limited in characterizing a user's intent and profile accurately. Moreover, often such information is outdated and is incorrectly entered making them prone to be highly noisy. Once entered in a database they cannot be easily updated or corrected.
The main way to deal with unstructured activities has been to use taxonomies with predefined categories organized in various data structures, such as a tree. For example if a person visits a sports page talking about the Los Angeles Lakers then that activity could be categorized as an activity related to Sports/Basketball/Lakers. These categories are then aggregated to create user profiles. The major drawbacks of such an approach are two-fold: (i) taxonomies have to be defined manually and can comprise only a limited number of categories in them. The manual nature of the process makes it less expressive, and user actions cannot be captured comprehensively and at the right granularity by such necessarily limited sets of categories. (ii) Every action and content has to be classified as belonging to one of the categories in a taxonomy and this process of classification is highly error prone. The only ways to achieve such classification is via (i) extensive training, which means providing examples of known pages or content for each category and (ii) providing a set of keywords or terms for each category and a classification is done based on how many or what sets of such keywords appear in a document. Both of these methods are highly manual and have computational problems associated with them, including (i) the accuracy of the underlying classification engine is only as good as the training sets provided to them; it can lead to over training quite easily and thereby poor generalization capabilities on new content, (ii) the bigger the taxonomy the more is the manual and supervised part of the training process, (iii) keywords are notoriously ambiguous and lead to highly inaccurate classifications, and finally (iv) often documents or content belong to multiple categories at the same time, and training for such cases that involves classifying documents as belonging to more than one category at the same time leads to a combinatorially intractable problem.