Identifying topics of text has been an area of study for several years, and identifying such in unconstrained speech has been an area of growing interest. The latter of these two areas, however, seems to be more difficult since much of the information conveyed in speech is never actually spoken and since utterances frequently are less coherent than written language.
The standard method of electronically searching for a document related to a particular topic is by using keywords. In a keyword search, a user selects a small set of words (i.e., the keywords) which may be expected to occur in documents related to the topic of interest. The documents are then searched for occurrences of the keywords. Documents containing the keywords are then presented to the user. A disadvantage of this method is that relevant documents that do not include the keywords will not be retrieved.
The keyword search method has been improved by the inclusion of Boolean operations. In 1847 George Boole developed the fundamental ideas of using mathematical symbols and operations to represent statements and to solve problems in logic. Boolean operations include "and", "or", etc. The user may use Boolean operators in conjunction with keywords to search for documents containing the keywords and the relationship established between the keywords by the Boolean operator (e.g., keyword1 "and" keyword2).
A further improvement to the keyword search is the use of "stemming." Stemming, or truncated stemming, is the process of shortening a keyword by removing one or more letters from the end of the keyword or identifying a keyword with a linguistic baseform word. Each unique modification of a keyword is called a "stem." A keyword may have more that one stem. By using stemming, the user searches for documents containing either the keyword or the stems of the keyword. The disadvantage of truncated stemming is that a stem may not have the same meaning as the keyword because a stem may not be a true base form, or root, of the keyword. For example, the truncated stems of the word "carpet" include the words "car", "carp", and "carpe." The first two stems have a different meaning than the keyword and the third stem may be confused with a word in a different language.
A variant of stemming is the use of "N-grams." An N-gram of a keyword is a sequence of consecutive keyword characters having length "N." A keyword may result in more than one N-gram. The list of N-grams is generated by sliding an N-long window through the keyword one character location at a time and recording the N-gram contained in the window at each slide position. For example, the word "carpet" has the following 3-grams "car", "arp", "rpe", and "pet." The N-grams generated from the keywords are then used to search for other documents containing the N-grams. N-grams may be used on multiple keywords or a section of text. N-gram statistics (e.g., N-gram type and frequency) are obtained for the query document (e.g., keywords or text) and the documents being searched. Search documents that statistically similar to the query document are returned. The disadvantage of this method is that it cannot be used to generate a topical description of a document or return documents that use different words to discuss the topic of interest.
A further improvement to keyword searching is the inclusion of training. Training may take the form of rules for determining the context of words. That is, the context of keywords in a query may be determined so that documents containing the keywords in the same context as the query are returned while documents that do not contain the keywords and documents that contain the keywords in a different context from the query are ignored. The disadvantage of this method is that it requires human intervention to establish the rules and is not fully automatic.
U.S. Pat. Nos. 5,265,065, entitled "METHOD AND APPARATUS FOR INFORMATION RETRIEVAL FROM A DATABASE BY REPLACING DOMAIN SPECIFIC STEMMED PHASES IN A NATURAL LANGUAGE TO CREATE A SEARCH QUERY," and 5,418,948, entitled "CONCEPT MATCHING OF NATURAL LANGUAGE QUERIES WITH A DATABASE OF DOCUMENT CONCEPTS," discloses a method of automatically generating a search query from a user generated natural language input by parsing the input, removing stop words, stemming the remaining words, and combining those words that are commonly collocated. U.S. Pat. Nos. 5,265,065 and 5,418,948 do not disclose a method of automatically generating a topic description for text and searching and sorting text by topic using the same as does the present invention. U.S. Pat. Nos. 5,265,065 and 5,418,948 are hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,369,577, entitled "TEXT SEARCHING SYSTEM," discloses a method of searching text where a user provides a first word, the method automatically generates a series of words that are lexically related to the first word, and searches a collection of words to detect the occurrence of any of the words that were automatically generated. U.S. Pat. No. 5,369,577 does not disclose a method of automatically generating a topic description for text and searching and sorting text by topic using the same as does the present invention. U.S. Pat. No. 5,369,577 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,440,481, entitled "SYSTEM AND METHOD FOR DATABASE TOMOGRAPHY," discloses a method of identifying word phrases by counting phrases of any length in the text, sorting the phrases by frequency of occurrence, sorting the phrases, and selecting those phrases that are above a user-definable threshold. U.S. Pat. No. 5,440,481 does not disclose a method of automatically generating a topic description for text and searching and sorting text by topic using the same as does the present invention. U.S. Pat. No. 5,440,481 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,576,954, entitled "PROCESS FOR DETERMINATION OF TEXT RELEVANCY," discloses a method of document retrieval by determining the meaning of each word in a query and each word in the documents, making adjustments for words in the query that are not in the documents, calculating weights for the semantic components in the query and in the documents, multiplying the weights together, adding the products to determine a real value number for each document, and sorting the documents in sequential order. U.S. Pat. No. 5,576,954 does not disclose a method of automatically generating a topic description for text and searching and sorting text by topic using the same as does the present invention. U.S. Pat. No. 5,576,954 is hereby incorporated by reference into the specification of the present invention.