1. Field of the Invention
The present invention relates to a method and system for searching for information in a data set, and, in particular, to methods and systems for syntactically indexing and searching data sets to achieve greater search result accuracy.
2. Background
Often times it is desirable to search large sets of data, such as collections of millions of documents, only some of which may pertain to the information being sought. In such instances it is difficult to either identify a subset of data to search or to search all data yet return only meaningful results. Several search techniques have been used to support searching large sets of data, none of which have been able to attain a high degree of accuracy of search results due to their inherent limitations.
One common technique is that implemented by traditional keyword search engines. Data is searched and results are generated based on matching one or more words or terms designated as a query. The results are returned because they contain a word or term that matches all or a portion of one or more keywords that were submitted to the search engine as the query. Some keyword search engines additionally support the use of modifiers, operators, or a control language that specifies how the keywords should be combined in a search. For example, a query might specify a date filter to be used to filter the returned results. In many traditional keyword search engines, the results are returned ordered, based on the number of matches found within the data. For example, a keyword search against Internet websites typically returns a list of sites that contain one or more of the submitted keywords, with the sites with the most matches appearing at the top of the list. Accuracy of search results in these systems is thus presumed to be associated with frequency of occurrence.
One drawback to traditional search engines is that they don't return data that doesn't match the submitted keywords, even though it may be relevant. For example, if a user is searching for information on what products a particular country imports, data that refers to the country as a “customer” instead of using the term “import” would be missed if the submitted query specifies “import” as one of the keywords, but doesn't specify the term “customer.” (E.g., The sentence “Argentina is a customer of the Acme Company” would be missed.) Ideally, a user would be able to submit a query in the form of a question and receive back a set of results that were accurate based on the meaning of the query—not just on the specific terms used to phrase the question.
Natural language parsing provides technology that attempts to understand and identify the syntactical structure of a language. Natural language parsers have been used to identify the parts of speech of each term in a submitted sentence to support the use of sentences as natural language queries. They have been used also to identify text sentences in a document that follow a particular part of speech pattern; however, these techniques fall short of being able to produce meaningful results when the documents do not follow such patterns. The probability of a sentence falling into a class of predefined sentence templates or the probability of a phrase occurring literally is too low to provide meaningful results. Failure to account for semantic and syntactic variations across a data set, especially heterogeneous data sets, has led to disappointing results.