The World Wide Web (WWW) is comprised of an expansive network of interconnected computers upon which businesses, governments, groups, and individuals throughout the world maintain inter-linked computer files known as web pages. Users navigate these pages by means of computer software programs commonly known as Internet browsers. Due to the vast number of WWW sites, many web pages have a redundancy of information or share a strong likeness in either function or title. This has increased the need for data analysis systems that assist in the retrieval of information from large databases. The sheer size of such databases makes it difficult to avoid retrieving extraneous information.
In many conventional search and retrieval systems, the user formulates a search request but does not have access to the actual textual data so the user must guess at the proper request to obtain the desired data. One such conventional system for retrieving textual data is a keyword search system. A keyword search system requires the user to enter one or more keywords, and the system conducts a search of the database using these keywords.
If the user knows the exact keywords that will retrieve the desired data, then the keyword search may provide useful results. However, users do not necessarily know the exact keyword or combination of keywords that will produce the desired data. Even when the search and retrieval system retrieves the desired data, the system may also retrieve a large amount of extraneous data that happens to contain the keywords. The user must then sift through all of the extraneous data to find the desired information, typically a time-consuming process.
Another difficulty with conventional keyword based searches relates to the inherent properties of the human language. A keyword selected by the user may not match the words within the text, or may retrieve extraneous information because a user can choose different keywords to describe the same object. For example, one person may call a particular object a “bank” while another person may call the same object a “savings and loan institution.”
In addition, the same word may have more than one distinct meaning depending on its context. Because the keyword does not convey information about the desired context of the word, using the keyword “bank” may retrieve text about a riverbank and a savings bank when only articles about a savings bank are desirable.
To overcome these and other problems in searching large databases, considerable research has been done in the areas of statistical natural language processing, also referred to as “text Mining.” This research has focused on the generation of simplified representations of documents to facilitate the ability to find desired information among a large number of documents.
One common simplification called a “bag of words” representation ignores the order of words within documents. Each document is represented as a vector consisting of the words in the document regardless of the order of their occurrence. However, this approach loses any information relating to the word context or word order. Therefore, the ability to discriminate desired information based on context and thus respond appropriately to keyword searches is also lost.
Other models have been developed for modeling language that account for sequences of words. However, such models are quite specialized and their implementation can become complicated. Consequently, these models are not very useful for general text mining of large databases or the Internet.
Thus, there is need for improved techniques to assist in searching large databases. To this end, there is also a need for improvements in statistical natural language processing that overcomes the disadvantages of search models used for general text modeling. Currently, models that account for word sequence to keep contextual information lose both search flexibility and the ability to quickly search large databases while the models that can easily and quickly search large databases lose contextual information. The need for such a system has heretofore remained unsatisfied.