The present disclosure relates generally to the field of semi-supervised clustering.
Data clustering is an important task that has found numerous applications in many domains, including information retrieval, recommender systems, computer vision, etc. However, data clustering is inherently a challenge and ill-posed problem due to its unsupervised nature. Semi-supervised clustering addresses this issue by effectively exploring the available side information that is often cast in the form of pairwise constraints: must-links for pairs of data points that belong to the same cluster, and cannot-links for pairs of data points that belong to different clusters. There are two major categories of approaches to semi-supervised clustering: (i) the constrained clustering approaches that employ the side information to restrict the solution space and only find the solution that is consistent with the pairwise constraints, and (ii) the distance metric learning based approaches that first learn a distance metric from the given pairwise constraints, and then perform data clustering using the learned distance metric.
One issue that is often overlooked by the existing semi-supervised clustering approaches is how to efficiently update the data partitioning when the pairwise constraints are dynamic, i.e., when new pairwise constraints are generated sequentially. This is a natural situation in various real-world applications. For example, in social networks, the user attributes, such as gender, school, major, nationality, interests, etc., can be considered as features, and the social connections, like friendship and common community membership, can be considered as the side information. Hence, the task of grouping users is essentially a semi-supervised clustering problem. However, since the connections in the social network are often changing, one needs a semi-supervised clustering algorithm that is able to cope with dynamic pairwise constraints. In addition, applications of crowdsourcing often require soliciting contributions dynamically collected from a large group of human workers. In the framework of semi-supervised crowd clustering, a set of objects should be partitioned based on their features as well as manual annotations collected through crowdsourcing, where those annotations (e.g., pairwise constraints) are provided by a set of online users instead of a single oracle in a sequential way.
Active semi-supervised clustering aims to select the pairwise constraints or queries in an active way to improve the clustering performance. Thus this active clustering also needs efficient updating with dynamic pairwise constraints. Due to the requirements of dynamic updating, neither the constrained clustering nor the distance metric based semi-supervised clustering efficiently update the partitioning results when the pairwise constraints are changed since each clustering needs to re-optimize the objective function over all the data points, making both clusterings computationally infeasible for large scale datasets.
Moreover, in social network analysis, the problem of updating the user communities based on the connections has attracted growing attention. However, the existing work about social network partitioning are different from the setting of dynamic semi-supervised clustering, mainly because of the following two reasons: (i) the existing work often only uses link information to guide the clustering and ignores the important feature information of the data points (i.e., users), and (ii) the existing work needs to observe all the link information between users, while semi-supervised clustering only assumes to observe a small portion of pairwise constraints to guide the partition.