1. Field of the Invention
The present invention relates generally to natural language processing, and more particularly to systems and methods for processing and retrieving natural language text using probabilistic modeling of words and documents.
2. Description of the Related Art
With the expanding use of the Internet there has been an increase in the number of people having access to large databases containing textual information. This has increased the need for systems for analyzing data in large databases to assist in the retrieval of desired information. The sheer size of the available databases makes it difficult to avoid retrieving extraneous information. Many typical text search and retrieval systems are top-down systems where the user formulates a search request but does not have access to the actual textual data so the user must guess at the proper request to obtain the desired data. One conventional top-down system for retrieving textual data is a keyword search system. In a keyword search query, the user enters one or more keywords and then a search of the data vase is conducted using the keywords. If the user knows the exact keywords that will retrieve the desired data, then the keyword search may provide useful results. However, most users do not know the exact keyword or combination of keywords that will produce the desired data. In addition, even though a specifically focused keyword may retrieve the desired data, they may also retrieve a large amount of extraneous data that happens to contain the keywords. The user must then sift through all of the extraneous data to find the desired data, which may be a time-consuming process.
Another problem with conventional keyword based searches is related to the inherent properties of the human language. A keyword selected by the user may not match the words within the text or may retrieve extraneous information for a couple of reasons. First, different people will likely choose different keywords to describe the same object. For example, one person may call a particular object a Abank@ while another person may call the same object a Asavings and loan@. Second, the same word may have more than one distinct meaning. In particular, the same word used in different contexts or when used by different people may have different meaning. For example, the keyword Abank@ may retrieve text about a riverbank or a savings bank when only articles about a saving bank are desirable, because the keyword does not convey information about the context of the word.
To overcome these and other problems in searching large databases considerable research has been done in the areas of Statistical Natural Language Processing, also referred to as Text Mining. This research has focused on the generation of simplified representations of documents. By simplifying document representation the ability to find desired information among a large number of documents is facilitated. One common simplification is to ignore the order of words within documents. This is often called a Abag of words@ representation. Each document is represented as a vector consisting of the words, regardless of the order of their occurrence. However, with this approach information relating to the context and meaning of the words due to their order is lost and the ability to discriminate desired information is sometimes lost.
Other models have been developed for modeling language that do take sequences of words into account. However, such models are quite specialized and can become quite complicated. Hence they are not very useful for general text mining.
Thus, there is a need for improved techniques to assist in searching large databases. To this end there is also a need for improvements in Statistical Natural Language Processing that overcomes the disadvantages of both the models that take the sequences of words into account and those that do no take the sequence of words into account.
The present invention has carefully considered the above problems and has provided the solution set forth herein.