This invention relates generally to machine processing of text, and more particularly to text classification systems and methods for the analysis and management of text. As used herein, text classification refers to the automated grouping of textual or partial textual entities for document retrieval, categorization, routing, filtering, and clustering, as well as text processing tasks such as tagging, disambiguation, and domain specific information extraction processing.
Traditional approaches to automated information retrieval have typically used key words and statistical techniques to locate and retrieve relevant documents. Statistical techniques have the advantage of being easily automated, but require a large collection of text to process and, in general, have not demonstrated a high degree of accuracy and precision. In addition, indexing and classification of documents must be done largely by humans for each text entity in the collection. Linguistic approaches also have several limitations. Different words and phrases can be used to express the same concept, and differences in vocabulary can pose significant problems in text processing. For example, synonyms for the term cancer include neoplasm, tumor, malignancy, carcinoma, and sarcoma. Key word searches that do not include synonyms for terms could miss relevant documents. This is particularly true in certain areas such as medicine, engineering, and other fields where different words and phrases can express the same concept. One approach to handling this problem is to use a dictionary that is specific for a particular subject area. Typically, the dictionary must be created manually. This is a time consuming task, and if inaccurate or incomplete can still miss many relevant documents. Another problem is that words may have multiple meanings, even in the same text. The word nail, for example, can refer to a metal fastener or to the nail on a person's finger or toe which the hammer may strike of it misses the "nail." Similarly, the word post may be a noun referring to a vertical support, as for a fence, or to a newspaper, and may be a verb referring to entering transactions into a ledger or sending a message. Since the meanings of words and phrases can vary depending upon context, lexical ambiguity limits the accuracy with which traditional automated approaches can process text.
Text classification systems enable documents to be indexed and classified by subject or area of interest and are helpful in improving the ability to retrieve information. A number of classification systems have been developed for this purpose. These include, for example, the Standard Industry Classification (SIC) and the Medical Subject Headings (MeSH) systems. The MeSH system is a hierarchical system of headings or classifications which are assigned to documents based upon context. There are approximately 100 top level MeSH categories, and approximately 32,000 nodes or clusters in the hierarchical classification tree. One or more MeSH classification codes are assigned to documents in the medical field depending upon subject and context. The classification codes often overlap, however, and the association between codes may derive from different, unrelated, or obscure contexts. It is difficult for a human being to keep all of the various classifications in mind, which makes classification tasks a complex and time consuming operation requiring a high level of expertise. Therefore, it is highly desirable to have an automated system for performing this task.
Several statistical approaches to classification have been used. One approach known as the vector space model determines the occurrence of each word in a group of documents comprising n words, and constructs a vector having n elements with values representing the weight of each word. Weight may be determined, for example, as the log (1+1/f) where f represents the frequency of occurrence. Higher frequencies result in lower weights. The relationship of a document to a cluster (classification category) is determined by the cosine of the angle between a vector which characterizes the document and a vector which characterizes the cluster. If the cosine value is high, the document belongs to the cluster.
Another approach represents each document of a plurality of documents as a vector in n-dimensional space, and determines a point corresponding to the centroid of the space. The distance between a document to be classified and the centroid is then measured, and a weighting algorithm is employed to determine the relationship of the document to the cluster. According to another similar approach, instead of measuring the distances between a centroid and a document, a cloud or cluster of points in space is constructed which represent a cluster of related documents. A new document is classified in a particular cluster if its point in space is within the cloud for that cluster. Additional approaches include using decision trees, linear classifiers, and probabilistic methods.
The problem with such known techniques is that they do not work very well because the vectors are "noisy", i.e., contain many non-relevant words to a particular classification or cluster. Decisions based on the number of occurrences of words do not account for the fact that noise produces random fluctuations in word frequency. Furthermore, many such documents contain a large number of words which are common to different clusters and may have little relevance to a particular classification. For example, words such as hospital, patient, surgery, etc. may appear commonly in documents related to medicine which are classified in a large medical database, but such common words (to the medical field) would not be particularly relevant or helpful in classifying a document in a particular area such as breast neoplasm. To cope with this problem, some approaches have attempted to give common words a lower weight. This has not been successful primarily because weights have been assigned across a whole collection in advance of knowing the significance to a subcollection of documents.
As a result of problems such as the foregoing, known automated text classification systems and methods have suffered from lack of accuracy and precision. It is desirable to provide machine-implemented text processing systems and methods which overcome such problems of accuracy and precision and which provide an efficient, inexpensive, and rapid text classification system and method which rivals human experts. It is to these ends that the present invention is directed.