1. Technical Field Text
The disclosed embodiments are related to search technology and more particularly to natural language processing.
2. Background Information
Traditionally, computer programs have used a structured language for input. For example, a conventional search engine may parse Boolean-style syntax, such as may be used in a search query. For example, the search query “college OR university” may return results with “college,” results with “university,” or results with both, while the search query “college XOR university” may return results with “college” or results with “university,” but not results with both.
More recently, there has been an effort to develop systems for natural language processing (NLP) to allow input using a natural language. As natural language systems become ubiquitous, users of such systems expect a progressively higher quality in their conversational interfaces. Because users utilize these systems for real time information retrieval, there is also a significant performance requirement, particularly for systems deployed at web scale.
Most current NLP solutions are therefore implemented with machine learning techniques, which are run over large corpora of training data sets, both for general queries, as well as domain specific queries. (If domain specific corpora are used, categorizers can be trained to first detect the domain of the query, and then interpret the query based on the domain.) This creates two specific problems. First, adding natural language solutions to a particular domain requires a large data science team, with access to large sets of historical queries. This makes NLP interfaces very exclusive, provided by providers with such data teams, and provided only for domains with large corpora.
This situation described above has two distinct “long tails”. There is a long tail for groups with private data or small data sets that cannot afford the efforts of a full data science team. Any group without an existing corpora, either because the data and queries are private, or because utilization is too low, will therefore be excluded from this potentially rich interface. There is a similar long tail for queries and constructions with frequencies so low that they are not captured by such techniques. With domains that are progressively more complex (and thus require more precise understanding) such as queries over a relational database system, the percentage of queries that fall into this long tail goes up precipitously.
To give an example in a domain such as email search, consider the variations of semantically identical ways to say “email from john”                “email from john”        “email john received”        “john's email”        “email received by john”        “email that john got”        “email that was received by john”        
As these phrases become more complex, they become more awkward, but are still obviously semantically identical to the most common base: “email from john”. However, techniques that attempt to identify a person name along with an email object (and ignore the prepositions, verbs, and other supporting words) will be confounded by cases where a recipient and a sender are both specified, particularly in situations where the instances do not directly appear in a corpora.
One possible solution is defaulting to a “From” interpretation, which might seem a good tradeoff in a situation where a person name in an email query specifies “From” semantics 95% of the time. However, in the cases where a “To” semantics is explicitly specified, such a system would be wrong 100% of the time, and such technology would not be extendible to other domains with less skewed semantics. As the bar for conversational interfaces rises, this becomes a less acceptable tradeoff.
Thus it would be beneficial to gain comprehensive and exhaustive grammatical (and even not-quite-grammatical) coverage over domains with sparse or non-existent corpora. Solving this problem would enable the capability of NLP systems to domains where corpora are non-existent, to sparse, or are too difficult to obtain or process.