The daily lives of information workers abound with clustering and categorization problems. Faced with a large number of documents, data records, or other items, their task is to group things into categories. This is important for applications as broadly varied as categorizing bugs, sorting user feedback, grouping machines in an organization, sorting job applications, and many others. Users faced with such tasks are often in desperate need of machine assistance, especially when overwhelmed by large numbers of items. Applying traditional clustering algorithms to such tasks is generally not helpful. This is because, to be effective, such algorithms require identifying an appropriate “distance metric” for determining how items are related to one another in terms of their features. This is something that users may not know, or at the least may be difficult for them to express. Furthermore, even if an appropriate distance metric can be determined, the application of such a metric may not result in the desired clustering in all cases.
Recently, there has been a steady stream of work in the literature concerning “interactive clustering.” In accordance with interactive clustering, a user provides assistance to a learning algorithm in automatically clustering items. The methods proposed typically involve obtaining input from users in the form of “must-link” and “cannot-link” constraints, which specify whether two items belong together or apart. These constraints are then used to learn a distance metric, after which traditional clustering mechanisms such as k-means can be used to group items. It has been observed, however, that specifying such “must-link” and “cannot-link” constraints is not a natural part of users' behavior when performing a clustering task. Rather, users typically prefer to make semantically meaningful clusters and incrementally add items to them. The methods proposed in the literature do not leverage this user behavior in any way to perform interactive clustering. Furthermore, the distance metrics derived using the aforementioned methods may still lead to disappointing results.