The subject development relates to active learning methods and classifying systems for data items such as structured document systems and especially to such systems for adjusting a classifier for document systems wherein the documents or portions thereof can be characterized and classified for improved automated information retrieval. The development relates to a system and method for annotating document elements or adjusting classes of the classifier for the document data elements so the document and its contents can be more accurately categorized and stored, and thereafter better accessed upon selective demand.
In legacy document systems comprising substantial databases, such as where an entity endeavors to maintain an organized library of semi-structured documents for operational, research or historical purposes, the document files often have been created over a substantial period of time and storage is primarily for the purposes of representation in a visual manner to facilitate its rendering to a human reader. There are often no corresponding annotations to the document to facilitate its automated retrieval by some characterization or classification system sensitive to a recognition of the different logical and semantic constituent elements.
Accordingly, these foregoing deficiencies evidence a substantial need for somehow acquiring an improved system for logical recognition of content and semantic elements in semi-structured documents for better reactive presentations of the documents and response to retrieval, search and filtering tasks.
Concept models for annotating such systems usually start with a training set of annotations that can identify element instances in the document or data item being classified, for example, element instances such as author, title or abstract. Such annotations correspond to identification of distinctive features that can be determined to collectively define a class of the element instance which in turn can be interpreted to suggest the appropriate annotation. The training set originates from an annotator/expert involved in the classifying of the data items.
As the complexity and voluminous extent of documents or data collections increase, the difficulties in accurately and quickly classifying the data items in the collections as well as elements in the documents also increase. Better models for the annotating process need to be developed, which if were obtained through only manual efforts of the annotator/expert, would result in highly undesirable inefficiencies in evolving the annotating model. Accordingly, there is a need for a better machine implemented active learning method for evolving a classifier.
The subject development thus also relates to machine training of a classifying system. A wide number of machine learning techniques have also been applied to document classification. An example of these classifiers are neural networks, support vector machines [Joachims, Thorsten, “Text categorization with support vector machines: Learning with many relevant features”, Machine Learning: ECML-98. 10th European Conference on Machine Learning, p. 137-42 Proceedings, 1998], genetic programming, Kohonen type self-organizing maps [Merkl, D., “Text classification with self-organizing maps: Some lessons learned”, Neurocomputing Vol. 21 (1-3), p. 61-77, 1998], hierarchical Bayesian clustering, Bayesian network [Lam, Wai and Low, Kon-Fan, “Automatic document classification based on probabilistic reasoning: Model and performance analysis”, Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, Vol. 3, p. 2719-2723, 1997], and Naïve Bayes classifier [Li, Y. H. and Jain, A. K., “Classification of text documents”, Computer Journal, 41(8), p. 537-46, 1998]. The Naïve Bayes method has proven its efficiency, in particular, when using a small set of labeled documents and in the semi-supervised learning, when the class information is learned from the labeled and unlabeled data [Nigam, Kamal; Maccallum, Andrew Kachites; Thrun, Sebastian and Mitchell, Tom, “Text Classification from labeled and unlabeled documents using EM”, Machine Learning Journal, 2000].
Active learning refers to a framework where the learning algorithm selects the instances to be labeled and then included in the training set. It often allows a significant reduction in the amount of training data needed to train a supervised learning method. Instead of annotating random instances to produce the training set, the active learning suggests to annotate those instances that are expected to maximally benefit the supervised learning method.
The conventional principle of active learning assumes a predefined and fixed concept definition, where the concept refers to the set of classes and their interpretation. The most traditional situation is one of evaluation testbeds, where the concept is predefined and given by a set of classes and a fully annotated collection of examples. Such testbeds are used in different domains in order to test, compare, and eventually improve existing machine learning techniques.
The concept evolution for the annotating model is a change in the class set or a shift in their interpretation. Current systems disallow any concept evolution since any change makes inconsistent the previous concept, along with the associated learning model and training sets. It often requires to restart the training process or, in the best case, to revise a part of the training set concerned with the change.
On the other hand, a possibility to evolve a concept appears as very important in real applications. It often originates from the complexity of input collections and a certain flexibility or even fuzziness in the task definition. For example, in the domain of meta-data extraction from digital and scanned documents and the semantic annotation of Web pages, the design of a learning model starts with some initial “idea” and often goes through a sequence of different corrections and adjustments. Such evolution of the concept may be critical in pursuing the following goals:
1. Refining the problem in a way that better corresponds to given collections, including a discovery of a hidden knowledge (new elements, finer interpretation of existing ones, relations between elements, etc.) that can beneficial for the final application, for faster learning, etc.
2. Better matching quality constraints, imposed by the pricing and contracting causes. It is often preferable to recognize instances of a sub-class AA with 98% accuracy, than instances of a super-class A with accuracy 70%.
3. If the modeling task is shared between different partners that follow (slightly) different concept definitions, the unification of their efforts and models might impose something similar to a concept change.
4. If the deployment of extracted data is changed due to some external reasons, like the domain ontology update.
Unfortunately, any concept change makes a part of the annotations inconsistent. If the investment in annotated samples has been important, retraining a model from scratch represents a substantial cost. To avoid the restart of the process or the re-annotation of “become-inconsistent” training examples, an active learning principle can assist a designer in pivoting the system toward the new concept definition and tuning up the associated learning model.
Accordingly, there is a need for improved methods and systems for retraining a maximum entrophy classifier when incremental changes are made to the definitions of the classes it must detect. The retraining should occur in an active learning framework and the system may choose new instances to be annotated in order to increase classification accuracy with less effort from human annotators.