1. Field of Invention
This invention relates to systems and methods for analyzing strings over a language.
2. Description of Related Art
There are many conventional systems and methods for storing and indexing each level of the inner structure of a string over a language having a vocabulary and a grammar. The most familiar are the various “search engines” that are available for use on the Internet. Conventional search engines typically allow a user to input a search string. Then, the search string is compared against the text of an entire document or a summary of its contents. Unfortunately, accurately searching the entire text of a large number of documents is extremely resource intensive. Therefore, conventional search engines are limited by two major design constraints, accuracy and speed. As a result, for example, many search engines available to search Internet web pages, typically return far less than all of the possible web pages that match the search string, in order to cut down on the time required for the search.
Equally disadvantageous, many of the conventional search engines return results in which only the vocabulary of the search string is matched. Typically, the conventional search engine returns documents containing the same words or string of words, however, the grammatical relationship between the words in the search string is ignored. As a result, many documents returned as a match may contain a random combination of the terms in the search string but in a totally unrelated context.
Some search engines have tried to preserve a crude representation of the grammatical relationships in the search string while searching documents by returning documents in which the words of the search string are only separated by a user defined number of words, for instance, ten words. However, this system requires complex operators within the search string to define the acceptable distance between words of the search string in the document. Furthermore, it does not preserve the actual grammatical relationship between the words in the search string, but only attempts to very roughly approximate a grammatical relationship due to the proximity of the words.
Other conventional and somewhat more complex indexing systems use “tokens” to define an axis in a multi-vector environment. An example of one such system is the SMART system. In SMART, a “token” is a single word or multi-word expression. According to the SMART system, each token defines an axis in a multi-vector environment. A document is first decomposed into a list of its tokens. Each token is used to define a vector that specifies the position of that document in the multi-axis representation. Each non-zero value of a vector corresponds to a token that is actually present in the document. A value of a vector is computed on the basis of the number of times a token occurs in a document compared to the number of times that token occurs in all of the indexed documents. When a query is matched against a set of documents during a search, a specific vector is computed for the query. Then, the cosine of the query's vector is compared to the cosine of each of the document vectors as an approximation of the proximity between the query and the document.
The primary disadvantage of these vector based systems is that the comparison approximating the similarity between a query and documents is made through a global calculus. Once that global calculus is made, each document that was determined to be similar must be individually reanalyzed to determine which sub-parts match. Furthermore, the vector systems do not maintain the grammatical relationship between the words of a document, but rely on token phrases to approximate grammatical relationships. Frequently these token phrases do not accurately represent the specific grammatical relationship of the search string.