The present invention relates to a text classification method and apparatus.
Many teams are working hard to establish a natural language technology for various business applications. But no system can understand English text so that even simple parsing is not reliable.
But if the target text is well constrained, it is not difficult to parse it and understand its content. A good example is a computer language which is fully parsed and understood by a computer. Another example is the troubleshooting knowledge written in short English form. The problems for a copier are well categorized and its recommendations are expressed as:
CATEGORY; Image Quality
PROBLEM; Image is Reduced
RECOMMENDATION;
(1) Check Optical System PA1 (2) Check/replace ROM
This is not difficult for parsing and interpreting by computer. Of course, even with this simple example, the English Understanding System requires a large vocabulary dictionary and grammar knowledge, which is different depending on the application.
Although it is almost impossible for a computer to understand the text written by human beings, the classification of the English text is very helpful for text retrieval, which will reduce the searching area.
There are some known technologies to identify the author of the text [Tankard]. But no description of text classification has been available.
As mentioned above, it is important to classify the English text by simple algorithm. The traditional approach which has a large vocabulary dictionary and grammar may provide classified text after appropriate processing. But this approach is too complex, and so is not appropriate for quickly searching for a large body of text.