The number of electronic documents (emails, images, web pages, texts, etc.) that a user has to manage is often large and is constantly growing. One well-known method of constructing a tree of clusters of documents is the entirely manual method. A user who has a number of documents to classify creates a tree made up of directories into which the documents are inserted as and when required. This method has the advantages of conforming to the wishes of the user and of facilitating manual modification of the tree. It can nevertheless become very tedious if there are many documents to be listed. Other methods offer totally automatic classification of electronic documents, whereby documents are defined by attributes (for example the name, type, and size of a file, word counts for text documents, colorimetry measurements for images, etc.). The attributes for each document form a vector describing the document. It is then possible to define a distance between two vectors, and thus a metric for the proximity of the documents. Taking account of the distances between documents, these classification methods construct clusters that can be structured in tree form (for example the “Ascendant Hierarchical Classification” method) or in some other form (for example the “k-means” method). The drawback of such automatic classification methods is that they do not always correspond to the organization required by the user. No correction is possible and the user is obliged either to accept the clusters obtained or to start the whole process again with different initial parameters (for example the required number of clusters) to obtain a different result.
Automatic “supervised” or “semi-supervised” learning methods take into consideration criteria fixed a priori by users to implement a learning mechanism. In supervised classification, users must apply labels to some of the documents that they wish to cluster. Two documents having the same label must be in the same cluster, and vice versa. A supervised learning algorithm constructs a model that gives each unlabelled document an appropriate label, as a function of its description. A supervised method assumes that users know all possible labels for documents to be classified and therefore know the final organization cluster. Users rarely have this a priori knowledge of the classification structure for their data. The initial knowledge necessary for using such algorithms greatly restricts their use for managing documents.
An example of a semi-supervised method in which users specify objects as being similar or different is described in the document “Distance metric learning, with application to clustering with side-information”, Eric P. Xing, Andrew Y. Ng, Michael I. Jordan, and Stuart Russell (NIPS 15,2003). From this information, the system determines a metric (a weighting of the descriptive attributes of documents that favors certain attributes and penalizes others), which gives a new metric for the distances between documents to be adopted for the classification.
Another example of semi-supervised classification is proposed in the document “Constrainted k-means clustering with background knowledge”, Kiri Wagstaff, Claire Cardie, Seth Rogers, and Stefan Schroedl (ICML 2001). This method assigns constraints to pairs of documents to specify that they belong to the same cluster or that they do not belong to the same cluster. The prior art k-means method is then used to cluster the documents, attempting to conform as closely as possible to pre-assigned constraints in terms of belonging to clusters. That method works only for non-hierarchical classifications. Moreover, it offers no solution for modifying, deleting or moving documents in the classification obtained and is liable to fail if it proves impossible to comply with the set constraints (under such circumstances, no classification is effected).