1. The Field of the Present Invention
The present invention relates generally to an apparatus, system and method for semantic editing and search engine.
2. General Background
Traditional keyword-based information retrieval (“search engine”) applications are widely used for retrieving documents in large document collections. Information retrieval applications create indexes that record the terms that occur in the documents. Sometimes additional metadata such as the locations of these terms in the document and document categories can also be stored with documents. Information retrieval applications then match user query terms against these indexes and rank the resulting matches to provide a list of documents that best match the user's query.
The resulting set of documents may also be filtered or ranked based on additional criteria such as the categories or term location. Further refinements to query analysis, for instance, determining what kind of information the user's query is about, can be used to filter or rank retrieval results; or even identify a passage that is most relevant to the user's information need. For example, a linguistic pattern underlying of the query “when was Lincoln assassinated” could be matched against the text of an article on Abraham Lincoln that states that “Lincoln was assassinated on Apr. 14, 1865”. Various linguistic enhancements to information retrieval, commonly described as “question answering” technologies (see the Wikipedia article, http://en.wikipedia.org/wiki/Question_answering, for more details), have been developed. These systems typically analyze the structure and content of queries and retrieve the best matching results in a data store of similarly structured data. This data store may be a repository of organized facts or structured data extracted from document collections using the same techniques as used to analyze the queries (or even a combination of both).
Information retrieval and related question answering technologies are sometimes used to find information for customer or technical support knowledge bases, especially if these data are constructed as documents. Alternatively, uncomplicated, specialized answer-specific knowledge bases are sometimes created specifically to deal with answering import user questions (such as “frequently asked questions” or “FAQs”).
These natural language processing (“NLP”) and FAQ-based approaches to finding information in answer-specific knowledge bases have well-understood weaknesses. Information retrieval approaches allow the user to find documents, but their answers often imprecise. If a frequent word is used in a query, too many documents are returned. Sometimes no answer is returned at all. If a term not found in the document collection is used, no relevant documents may be found. Finally, if a term is ambiguous irrelevant documents may be returned.
Question-answering systems use complex and not entirely reliable NLP techniques. These techniques do not always extract relevant or useful data from document collections; and do not always correctly analyze queries. These techniques require considerable specialist expertise to develop; are computationally demanding; and all the same remain fragile, unreliable, and very difficult to tune or adapt.
FAQs are a simple approach to supplying authoritative answers to users' questions. They can often provide precise answers to user questions, however, typically FAQs are searched as if they were just documents. In other words, the FAQ-based approach is often just information retrieval performed against written text organized around a relatively small number of user questions.
However, the most significant weakness of all these approaches is their lack of feedback: the queries users make and the answers to these queries are not exploited or stored for future use. For instance, if a user's query does not match the terminology in the document base and nothing useful is retrieved, there is no recourse in these approaches. There is no mediating agent to interpret the query, identify answers for the query, and record these for future use.
This is all the more remarkable since it is well-recognized in information retrieval and question answering that many identical (or essentially identical) queries are submitted repeatedly. Moreover, the distribution of queries is Zipfian. (A Zipfian distribution is a highly skewed distribution with a small number of very frequent queries that account for much of the frequency mass of all queries; and a very large number of infrequent queries.) Each common query (and its synonymic or near synonymic variants) has associated answers and semantic categories. Being able to record and reuse these queries, answers and categories in a semantic search engine means that queries can be answered consistently with the very best answers. Since the most frequent queries appear very soon, the vast majority of queries are dealt with quickly. In short order, the knowledge work shifts to handling infrequent and more complicated queries that are not handled well by automation.
These frequency considerations apply also to user queries in customer and technical support knowledge bases. The described mapping of queries and their variants to answers is highly desirable here because customer and technical support personnel need to have at their fingertips the most appropriate, timely and up-to-date responses to users' questions.