The present invention relates to an automatic text classification system, and more specifically to a system for automatically classifying texts in terms of each of a plurality of qualities in a manner such that the classified texts can be automatically retrieved based on a specified one or more of the plurality of qualities. The invention also relates to a retrieval system using the plurality of qualities.
A variety of methods are known for automatically classifying and/or analyzing text, including keyword searching, collaborative filtering, and natural language parsing.
Keyword searching methods operate by simply looking for one or more keywords in a text and then classifying the text based on the occurrence (or non-occurrence) of the keywords. Keyword searching methods, however, suffer from the drawbacks that the main concept or a given text may be unrelated to the keywords being searched, and/or that a particularly relevant text may not contain the keywords being searched.
Collaborative filtering methods work by attempting to make recommendations and/or classifications based on matching overlapping results. For example, if a collaborative filtering system were used to analyze a series of questionnaires asking people to name their favourite musicians, the system would analyze the questionnaires by looking for an overlap in one or more of the musicians named in respective questionnaires. If an overlap were found between two questionnaires, the other musicians named by the author of the first questionnaire would be recommended to the author of the second questionnaire, and vice versa. The drawback of collaborative filtering, however, is that it assumes that people""s tastes that are similar in one respect are also similar in other respects. That is, collaborative filtering methods fail to take into account the underlying qualities that define people""s tastes.
Natural language parsing methods operate by performing semantic or lexical analysis based on rules of grammar and lexicons. To date, however, computers have been unable to fully understand natural language, and known natural language parsing methods too often misinterpret the actual meaning of text.
The above described drawbacks of keyword searching, collaborative filtering, and natural language parsing have created a need for more accurate and more meaningful text classification methods.
Recently, a company called Autonomy, Inc. has developed technology that is capable of analyzing text and identifying and ranking main ideas. As disclosed in the xe2x80x9cAutonomy Technology Whitepaperxe2x80x9d (available at www.autonomy.com), Autonomy""s technology can analyze text and identify key concepts based on a statistical probability analysis of the frequency and relationships of terms in the text that give the text meaning. Once the key concepts have been extracted from a text, xe2x80x9cConcept Agentsxe2x80x9d are created to seek out similar ideas in any other texts such as websites, news feeds, email archives or other documents. In addition, the xe2x80x9cAutonomy Technology Whitepaperxe2x80x9d discloses that the xe2x80x9cConcept Agentsxe2x80x9d can be used to create specific user profiles based on an analysis of the texts that a particular user reads, or that the xe2x80x9cConcept Agentsxe2x80x9d can be used to make users aware of others with similar interests. Still further, the xe2x80x9cAutonomy Technology Whitepaperxe2x80x9d discloses that the xe2x80x9cConcept Agentsxe2x80x9d can be used to automatically sort documents into predefined categories.
Indeed, by identifying key concepts based on a statistical probability analysis of the frequency and relationships of terms in a text that give the text meaning, Autonomy""s technology represents a significant advance over other known text searching techniques. However, by focusing on key concepts or xe2x80x9cConcept Agentsxe2x80x9d, Autonomy""s technology fails to identify the underlying qualities of the subject matter described in the text.
For example, if Autonomy""s technology were used to analyze a textual film synopsis, the extracted key concept would be films, and the film might even be classified into a predefined category such as comedy, romance, action/adventure or science fiction. However, Autonomy""s technology would fail to identify whether the text relates to, for example, a happy or sad film, a funny or serious film, a beautiful or repulsive film, a tame or sexy film, and/or a weird or conventional film. In this connection, it is pointed out that a romantic film, for example, can be each of happy or sad, funny or serious, beautiful or repulsive, tame or sexy, and weird or conventional. Accordingly, if a user were to access a data base of textual film synopses classified using Autonomy""s technology, the user would only be able to search for a desired film within the static, predefined categories into which the films were classified. Thus, if a user wanted to find a film that is each of happy, funny, repulsive, sexy and weird, Autonomy""s technology would be of little help.
It is an object of the present invention to provide a system for automatically classifying texts in terms of each of a plurality of qualities that are determined based on a statistical probability analysis of the frequency and relationships of words in the text.
It is also an object of the present invention to provide a system for automatically classifying texts in a manner such that the classified texts can be automatically retrieved using a xe2x80x9cfuzzy logicxe2x80x9d retrieval system capable of identifying a best match based on a specified one or more of a plurality of qualities.
An automatic text classification system is provided which extracts words and word sequences from a text or texts to be analysed. The extracted words and word sequences are compared with traiding data comprising words and word sequences together with a measure of probability with respect to the plurality of qualities. Each of the plurality of qualities may be represented by an axis whose two end points correspond to mutually exclusive characteristics. Based on the comparison, the texts to be analysed are then classified in terms of the plurality of qualities. In addition, a fuzzy logic retrieval system and a system for generating the training data are provided.