Often times it is desirable to search large sets of data, such as collections of millions of documents, only some of which may pertain to the information being sought. In such instances it is difficult to either identify a subset of data to search or to search all data yet return only meaningful results. The techniques that have been traditionally applied to support searching large sets of data have fallen short of expectations, because they have not been able to achieve a high degree of accuracy of search results due to inherent limitations.
One common technique, implemented by traditional keyword search engines, matches words expected to found in a set of documents through pattern matching techniques. Thus, the more that is known in advance about the documents including their content, format, layout, etc., the better the search terms that can be provided to elicit a more accurate result. Data is searched and results are generated based on matching one or more words or terms that are designated as a query. Results such as documents are returned when they contain a word or term that matches all or a portion of one or more keywords that were submitted to the search engine as the query. Some keyword search engines additionally support the use of modifiers, operators, or a control language that specifies how the keywords should be combined when performing a search. For example, a query might specify a date filter to be used to filter the returned results. In many traditional keyword search engines, the results are returned ordered, based on the number of matches found within the data. For example, a keyword search against Internet websites typically returns a list of sites that contain one or more of the submitted keywords, with the sites with the most matches appearing at the top of the list. Accuracy of search results in these systems is thus presumed to be associated with frequency of occurrence.
One drawback to traditional keyword search engines is that they do not return data that fails to match the submitted keywords, even though the data may be relevant. For example, if a user is searching for information on what products a particular country imports, data that refers to the country as a “customer” instead of using the term “import” would be missed if the submitted query specifies “import” as one of the keywords, but doesn't specify the term “customer.” For example, a sentence such as “Argentina has been the main customer for Bolivia's natural gas” would be missed, because no forms of the word “import” are present in the sentence. Ideally, a user would be able to submit a query and receive back a set of results that were accurate based on the meaning of the query—not just on the specific keywords used in submitting in the query.
Natural language parsing provides technology that attempts to understand and identify the syntactical structure of a language. Natural language parsers (“NLPs”) have been used to identify the parts of speech of each term in a submitted sentence to support the use of sentences as natural language queries against data. However, systems that have used NLPs to parse and process queries against data, even when the data is highly structured, suffer from severe performance problems and extensive storage requirements.
Natural language parsing techniques have also been applied to extracting and indexing information from large corpora of documents. By their nature, such systems are incredibly inefficient in that they require excessive storage and intensive computer processing power. The ultimate challenge with such systems has been to find solutions to reduce these inefficiencies in order to create viable consumer products. Several systems have taken an approach to reducing inefficiencies by subsetting the amount of information that is extracted and subsequently retained as structured data (that is only extracting a portion of the available information). For example, NLPs have been used with Information Extraction engines that extract particular information from documents that follow predetermined grammar rules or when a predefined term or rule is recognized, hoping to capture and provide a structured view of potentially relevant information for the kind of searches that are expected on that particular corpus. Such systems typically identify text sentences in a document that follow a particular part-of-speech pattern or other patterns inherent in the document domain, such as “trigger” terms that are expected to appear when particular types of events are present. The trigger terms serve as “triggers” for detecting such events. Other systems may use other formulations for specified patterns to be recognized in the data set, such as predefined sets of events or other types of descriptions of events or relationships based upon predefined rules, templates, etc. that identify the information to be extracted. However, these techniques may fall short of being able to produce meaningful results when the documents do not follow the specified patterns or when the rules or templates are difficult to generate. The probability of a sentence falling into a class of predefined sentence templates or the probability of a phrase occurring literally is sometimes too low to produce the desired level of recall. Failure to account for semantic and syntactic variations across a data set, especially heterogeneous data sets, has led to inconsistent results in some situations.