A portion of this disclosure, including Appendices, is subject to copyright protection. Limited permission is granted to facsimile reproduction of the patent document or patent disclosure as it appears in the U.S. Patent and Trademark Office (PTO) patent file or records, but the copyright owner reserves all other copyright rights whatsoever.
1. Field of the Invention
The present invention relates to systems and methods for automated classification. More specifically, the invention relates to automated systems and methods for classifying concepts (such as legal concepts, including points of law from court opinions) according to a topic scheme (such as a hierarchical legal topic classification scheme).
2. Related Art
Document classification has long been recognized as one of the most important tasks in text processing. Classification of documents provides for quality document retrieval, and enables browsing and linking among documents across a collection. The benefits of such easy access are especially apparent in slowly-evolving subject domains such as law. The generally stable vocabularies and topics of the legal domain insure long-term return on any classification work.
There are two broad document classification approaches: unsupervised learning and supervised learning. The approaches are differentiated by whether a pre-defined classification scheme is used.
Unsupervised learning is a data-driven classification approach, based on the assumption that documents can be well organized by a natural structure inherent to the data. Those familiar with the data should be able to follow this natural structure to locate their information. A large body of information retrieval literature has focused on this approach, mostly related to document clustering [Borko 1963, Sparck Jones 1970, van Rijsbergen 1979, Griffiths 1984, Willett 1988, Salton 1990]. More recently some machine learning techniques have been applied to this classification task [Farkas 1993]xe2x80x94the term xe2x80x9cunsupervised learningxe2x80x9d was coined to describe this approach. The following patents are associated with this approach: U.S. Pat. No. 5,182,708 and U.S. Pat. No. 5,832,470.
Opposite to the unsupervised learning approach to document classification is supervised learning. With this approach, a pre-defined xe2x80x9ctopic schemexe2x80x9d is given, along with the classified documents for each topic in the scheme. The topic scheme may be a simple list of discrete topics, or a complex hierarchical topic scheme. Supervised learning technology focuses on the task of feeding a computer meaningful topical descriptions so that it can learn to classify a document of unknown type.
When a topic scheme includes a simple list of discrete topics (one without a complex hierarchical relationships among the topics), the document classification becomes mere document categorization. Many machine learning techniques, including the retrieval technique of relevance feedback, have been tried for this task [Buckley 1994, Lewis 1994, and Mitchell 1997]. In addition to the effectiveness of learning methods themselves, the success of automatic categorization depends on the number of topics in the scheme, on the amount of quality training documents, and on the degree that the topics are mutually exclusive to one another. An example is disclosed in U.S. Pat. No. 5,675,710.
The more difficult document classification centers on classifying documents using a hierarchical topic scheme. In this task, one has to consider horizontal relationships among the sister topics, which tend to be close to each other and are thus confusing to a computer. Moreover, one must also be concerned with vertical inheritance relationships.
Many machine learning techniques have trouble accommodating these two semantic relationships simultaneously in their learning or training, and thereafter have difficulty in classifying documents effectively. The task becomes more challenging if the topic scheme is very large, if the training documents are not topically exclusive, if the size of documents is small, or if the documents lack descriptive information.
To face these challenges, some techniques (U.S. Pat. No. 5,204,812) have relied on human intervention. Others (U.S. Pat. No. 5,794,236) use simple but insightful pattern matching. Still others (U.S. Pat. Nos. 5,371,807 and 5,768,580) turn to linguistic knowledge to combat the ambiguity introduced in the hierarchical scheme.
However, these techniques can only handle small, domain-specific classification work. They have difficulty in scaled processing, either because of their simplicity in pattern recognition or because of the daunting demand of building expensive lexicons to support the linguistic parsing.
Thus, there is a need in the art to develop an economic, scalable machine learning process that can perform document classification with high accuracy using a large, hierarchical topic scheme. It is to meet this need that the present invention is directed.
Non-Patent References mentioned above:
Borko, H. and Bernick M. 1963. xe2x80x9cAutomatic document classification.xe2x80x9d Journal of the Association for Computing Machinery, pp. 151-161.
Sparck Jones, K. 1970. xe2x80x9cSome thoughts on classification for retrieval.xe2x80x9d Journal of Documentation, pp.89-102.
Van Riusbergen, C. J. 1979. Information Retrieval, 2nd edition, Butterworths, London.
Griffiths, A and others. 1984. xe2x80x9cHierarchic agglomerative clustering methods for automatic document classification.xe2x80x9d Journal of Documentation, pp. 175-205.
Willett, P. 1988. xe2x80x9cRecent trends in hierarchic document clustering: A critical review.xe2x80x9d Information Processing and Management, pp. 577-598.
Salton, G. and Buckley C. 1990. xe2x80x9cFlexible text matching for information retrieval.xe2x80x9d Technical Report 90-1158, Cornell University, Ithaca, N.Y.
Farkas, J. 1993. xe2x80x9cNeural networks and document classification.xe2x80x9d Canadian Conference on Electrical and Computer Engineering, pp. 1-4.
Buckley, C and others. 1994. xe2x80x9cAutomatic routing and ad-hoc retrieval using SMART: TREC-2.xe2x80x9d The 2nd Text Retrieval Conference, edited by Donna Harman, NIST Special Publication 500-215, pp.45-55.
Lewis, D. D. and Gale, W. A. 1994. xe2x80x9cA sequential algorithm for training text classifiers.xe2x80x9d Proceedings of the 7th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pp.3-12, London.
Mitchell, T. 1997. Machine Learning, McGraw Hill, New York.
The inventive system and method provide an economic, scalable machine learning process that performs document classification with high accuracy using large topic schemes, including large hierarchical topic schemes. More specifically, the inventive system and method suggest one or more highly relevant classification topics for a given document to be classified.
The invention provides several features, including novel training and concept classification processes. The invention also provides novel methods that may be used as part of the training and/or concept classification processes, including: a method of scoring the relevance of features in training concepts, a method of ranking concepts based on relevance score, and a method of voting on topics associated with an input concept.
In a preferred embodiment, the invention is applied to the legal (case law) domain, classifying legal concepts (such as rules of law) according to a proprietary legal topic classification scheme (a hierarchy of areas of law).