The Internet abounds with dyadic data that continues to increase rapidly as new websites come online and existing websites add new content. Generally, dyadic data are the measurements on dyads, which are pairs of two elements coming from two sets. For instance, well-known dyadic data on the Internet is the term-by-document representation of the web corpus, where the measurement on the dyad (term, document) can be the count of how many times the term appears in the document, or some transformed value such as the TF (term frequency)-IDF (inverse document frequency) score.
In general, dyadic data shares the characteristics of high dimensionality, sparsity, non-negativeness, and dynamicity. In the term-by-document matrix, for example, its dimensions are usually very large (e.g., millions to billions), and the measurements are sparse relative to the all possible dyads, i.e., a term does not appear in all documents. Finally, most measurements on web dyadic data are non-negative in that the measurements are based on event observations (e.g., impressions and clicks), which are defined as positive, if observed, and zero, otherwise. Additionally, as new words are invented and new webpages are put into public every day, the term-by-document dyadic data continually grows in terms of both the observed dyads and the dimensionality.
A commonly used tool in extracting the underlying structure is matrix factorization. However, the application of matrix factorization on real-world web dyadic data poses a serious challenge to the scalability of available tools.