The exemplary embodiment relates to object classification. It finds particular application in connection with multi-modality one-class classification of a large corpus of documents, based on extracted features, and in one particular case, where only a small corpus of labeled documents may be available.
In a world where information becomes available in ever increasing quantities, document classification plays an important role by preselecting what documents are to be reviewed by a person and in what order. Applications of document selection range from search engines to spam filtering. However, more specialized tasks can be approached with the same techniques, such as document review in large corporate litigation cases.
During the pre-trial discovery process, the parties are requested to produce relevant documents. In cases involving large corporations, document production involves reviewing and producing documents which are responsive to the discovery requests to the case. The number of documents under review may easily run in the millions.
The review of documents by trained personnel is both time-consuming and costly. Additionally, human annotators are prone to errors. Accuracy and lack of consistency between annotators can be a problem. It has been found that both speed and accuracy of reviewers can be improved dramatically by grouping and ordering documents.
Systems have been developed to support human annotators by discovering structure in the corpus and presenting documents in a natural order. Usually the software that organizes the documents takes into account only the textual content of the documents.