1. Field of the Invention
The invention generally relates to extracting informative phrases from unstructured text, and, more particularly, to a natural language processing of text documents for extracting phrases that best characterize the subject or set of documents being analyzed.
2. Description of the Related Art
Recently, there has been a rapid growth of on-line discussion groups and news websites on the World Wide Web (WWW). Determining what topics are being discussed on such websites and how those topics are being discussed could prove to be a valuable resource (e.g., to companies investigating market reactions to their own or a rival company's products, to politicians, etc.). However, the task of manually tracking such information from amongst the large corpus of documents contained on the Web is laborious. Therefore, there is a need for a computer-implemented method for automatically extracting a set of phrases that best characterizes a selected subject or a selected set of documents from amongst the corpus of documents contained on the Web. The challenge is both to extract these phrases quickly and to extract phrases that are meaningful and useful. For example, many Web documents have pages containing the phrase “home page”; however, this phrase is not likely to be useful in characterizing the selected subject or the selected document set. Similarly, the sequence of words “navigate>>next” may occur frequently near the selected subject or in the selected set of documents; however, this phrase is also not likely to be meaningful for most types of analysis. Therefore, the extracted phrases should be limited to phrases that provide information as to how the selected subject is being discussed or that provide information as to what language is being used in the selected set or subset of documents that differs from the rest of the corpus.