Field of the Subject Disclosure
The subject disclosure relates to natural language processing. Specifically, the subject disclosure relates to systems and methods for processing documents and queries using natural language processing to construct tuples, wherein each part of a tuple is linked to an entity of a given knowledge base.
Background of the Subject Disclosure
Natural language has been used for thousands of years to store and transfer knowledge between human beings. Documents in natural languages are the most important and popular source for information retrieval systems and search engines. In such systems, natural-language queries may be a more user-friendly way to search for information than keyword-based queries employed in current search engines.
Several methods have been developed for extracting meaning from natural-language queries and documents. One of the most popular methods is Latent Semantic Indexing. A natural-language document is analyzed to extract main keywords. Each keyword is transformed to its rooted form and weighted by a statistical measure, e.g. term frequency/inverse document frequency (TF/IDF). A vector of these weighted keywords is used to represent the document in applications. In a search engine, for instance, documents with keywords matching the queried keywords can be returned as search results. In information retrieval systems, for instance, the similarity of documents is represented by the distance between their representative vectors. Despite wide usage of Latent Semantic Indexing, this method discards or fails to consider several meaningful features of the analyzed document.
In another approach, Natural Language Processing (NLP) has been used to extract more syntactic information from natural-language documents. Each sentence of a natural-language document is parsed and linguistically processed to extract Subject-Action-Object (SAO) triples or extended SAO (eSAO) tuples. Each part of an SAO triple or an eSAO tuple may be a text phrase. However this method fails to address the complexity of linguistics, such as nested clauses within a sentence.