The present invention generally relates to ranking data items, and more specifically, to identifying a set of data items based on both relevance and diversity.
It is now widely recognized that diversity is a highly desired property in many data mining tasks, such as expertise and legal search, recommendation systems, blog filtering, document summarization, and others. It is a powerful tool to address the uncertainty and ambiguity and/or to cover the different aspects of an information need. Diversity is also positively associated with personnel performances and job retention rates in a large organization.
Diversified ranking on graphs is a fundamental mining task and has a variety of high-impact applications. Two important questions remain open in diversified ranking on large graphs. The first challenge is the measure—for a given top-k ranking list, how can we quantify its goodness? Intuitively, a good top-k ranking list should capture both the relevance and the diversity. For example, given a task which typically requires a set of different skills, if we want to form a team of experts, not only should the people in the team have relevant skills, but also they should somehow be ‘different’ from each other so that the whole team can benefit from the diversified, complementary knowledge and social capital. However, there does not exist such a goodness measure for the graph data in the literature. Most of the existing works for diversified ranking on graphs are based on some heuristics. One exception is described in a paper by Mei, et al. (Q. Mei, J. Guo, and D. R. Radev. Divrank: the interplay of prestige and diversity in information networks. In KDD, pages 1009-1018, 2010.) In this paper, the authors made an important step towards this goal by providing some optimization explanations, which is achieved by defining a time-varying objective function at each iteration. But still, it is not clear what overall objective function the algorithm tries to optimize.
The second challenge lies in the algorithmic aspect—how can we find an optimal, or near-optimal, top-k ranking list that maximizes the goodness measure? Bringing diversity into the design objective implies that we need to optimize on the set level. In other words, the objective function for a subset of nodes is usually not equal to the sum of objective functions of each individual node. It is usually very hard to perform such set-level optimization. For instance, a straight-forward method would need exponential enumerations to find the exact optimal solution, which is infeasible even for medium size graphs. This, together with the fact that real graphs are often of large size, reaching billions of nodes and edges, poses the challenge for the optimization algorithm—how can we find a near-optimal solution in a scalable way?
In the recent years, set-level optimization has been playing a very important role in many data mining tasks. Many set-level optimization problems are NP-hard. Therefore, it is difficult, if not impossible, to find the global optimal solutions. However, if the function is monotonic sub-modular with 0 function value for the empty set, a greedy strategy can lead to a provably near-optimal solution. This powerful strategy has been recurring in many different settings, e.g., immunization, outbreak detection, blog filtering, sensor placement, influence maximization and structure learning.