This application relates to social network analysis with prior knowledge and non-negative tensor factorization.
Social networking is a concept that has been around much longer than the Internet or even mass communication. People have always been social creatures; our ability to work together in groups, creating value that is greater than the sum of its parts, is one of our greatest assets. The social networking model has recently adapted to the World Wide Web. The Web model has changed from top-down to bottom-up creation of information and interaction, made possible by new Web applications that give power to users. While in the past there was a top-down paradigm of a few large media corporations creating content on the Web for the consumers to access, the production model has shifted so that individual users now create content that everyone can share. While the Web has functioned as an information repository, the advent of social networks is turning the Web into a tool for connecting people.
One issue in running social networks is the size of data generated. Data in many applications are polyadic, i.e., they have multiple dimensions. Data in social networks are such an example—for example, data in the blogosphere may have people dimension (the author of a blog post), content dimension (the body of the post), and time dimension (the timestamp of the post). Documents in a digital library are another example—a scientific paper can be described by its authors, keywords, references, publication date, publication venue, etc. To analyze such polyadic data, a very important task is to extract significant characteristics from different data dimensions, where the extracted characteristics can be used either directly for data summarization and visualization or as features for further data analysis. The extracted characteristics can be in the form of, using the blog example, salient communities among bloggers, coherent topics in blog posts, and noteworthy temporal trends of these topics and communities. Because these data dimensions affect each other in a joint way, approaches that either analyze each data dimension independently or only consider pairwise relations between data dimensions are not able to accurately capture the data characteristics.
Several multiple-dimensional tensor models have been proposed to capture the higher-order correlation (other than the second order correlation) among various data dimensions. These tensor-based approaches can be categorized into two groups. Approaches in the first group decompose polyadic data by using higher-order linear decompositions, which are extensions of the matrix singular value decomposition. On the other hand, approaches in the second group decompose polyadic data by using non-negative tensor factorizations (NTFs), which are extensions of the non-negative matrix factorization (NMF).
Non-negative tensor factorization is a relatively new technique that has been successfully used to extract significant characteristics from polyadic data, such as data in social networks. Because these polyadic data have multiple dimensions, NTF fits in naturally and extracts data characteristics jointly from different data dimensions. In the standard NTF, all information comes from the observed data and end users have no control over the outcomes. However, in many applications very often the end users have certain prior knowledge and therefore prefer the extracted data characteristics being consistent with such prior knowledge.
The approaches based on NTFs decompose data into additions of non-negative components and therefore have many advantages over those based on linear decompositions. Such advantages include ease of interpretation of the extracted characteristics, close connection to the probabilistic models, and no enforcement on the orthogonality among different data characteristics.
In an approach based on the standard NTF for extracting data characteristics, the extracted characteristics can be of arbitrary forms and end users do not have any control over them. Such an approach has some benefits—it is simple because it does not require any input other than the observed data. However, such a simple approach also has its weakness: end users have no channel to incorporate their prior knowledge into the process of characteristic extraction.