The growing volume of publicly available, machine-readable textual information makes it increasingly necessary for businesses to automate the handling of such information to stay competitive. By automating the handling of text, businesses can decrease costs and increase quality in performing tasks that require access to textual information.
A commercially important class of text processing applications is text classification systems. Automated text classification systems identify the subject matter of a piece of text as belonging to one or more categories from a potentially large predefined set of categories. Text classification includes a class of applications that can solve a variety of problems in the indexing and routing of text.
Routing of text is useful in large organizations where there is a large volume of individual pieces of text that needs to be sent to specific persons (e.g., technical support specialists inside a large customer support center). Indexing text is useful in attaching topic labels to information and partitioning the information space to aid information retrieval. Indexing can facilitate the retrieval of information based upon the contents of text rather than boolean keyword searches from databases that include information such as news articles, federal regulations, etc.
A number of different approaches have been developed for automatic text processing. One approach is based upon information retrieval techniques utilizing boolean keyword searches. This approach, however, has problems with accuracy. A second approach borrows natural language processing from artificial intelligence technology to achieve higher accuracy. While natural language processing improves accuracy based upon an analysis of the meaning of input text, speed of execution and range of coverage becomes problematic when such techniques are applied to large volumes of text.
Others have recognized the foregoing shortcomings and have attempted to reach a middle ground between information retrieval techniques and natural language/knowledge-based techniques to achieve acceptable accuracy without sacrificing speed of execution or range of coverage. This has been accomplished through predominantly rule based systems which parse the input text using natural language morphology techniques, attempt to recognize concepts in the text, and then use a rule base to map from identified concepts to categories.
Text classification systems which rely upon rule-base techniques also suffer from a number of drawbacks. The most significant drawback being that such systems require a significant amount of knowledge engineering to develop a working system appropriate for a desired text classification application. It becomes more difficult to develop an application using rule-based systems because all the requisite knowledge is placed into a rule base. By doing this, a knowledge engineer must spend a significant amount of time tuning and experimenting with the rules to arrive at the correct set of rules to ensure that the rules work together properly for the desired application.
Another shortcoming in the foregoing systems is that there is no built in mechanism to allow the knowledge base portion of a text classification system to learn from the input text over time to thereby increase system accuracy. The addition of a learning component to enhance the accuracy of a text classification system would be desirable to improve the performance of the system over time.