The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for performing natural language processing using a transaction based knowledge representation.
Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is often involved with natural language understanding, i.e. enabling computers to derive meaning from human or natural language input, and natural language generation.
NLP mechanisms generally perform one or more types of lexical or dependency parsing analysis including morphological analysis, syntactical analysis or parsing, semantic analysis, pragmatic analysis, or other types of analysis directed to understanding textual content. In morphological analysis, the NLP mechanisms analyze individual words and punctuation to determine the part of speech associated with the words. In syntactical analysis or parsing, the NLP mechanisms determine the sentence constituents and the hierarchical sentence structure using word order, number agreement, case agreement, and/or grammars. In semantic analysis, the NLP mechanisms determine the meaning of the sentence from extracted clues within the textual content. With many sentences being ambiguous, the NLP mechanisms may look to the specific actions being performed on specific objects within the textual content. Finally, in pragmatic analysis, the NLP mechanisms determine an actual meaning and intention in context (of speaker, of previous sentence, etc.). These are only some aspects of NLP mechanisms. Many different types of NLP mechanisms exist that perform various types of analysis to attempt to convert natural language input into a machine understandable set of data.
Modern NLP algorithms are based on machine learning, especially statistical machine learning. The paradigm of machine learning is different from that of most prior attempts at language processing in that prior implementations of language-processing tasks typically involved the direct hand coding of large sets of rules, whereas the machine-learning paradigm calls instead for using general learning algorithms (often, although not always, grounded in statistical inference) to automatically learn such rules through the analysis of large corpora of typical real-world examples. A corpus (plural, “corpora”) is a set of documents (or sometimes, individual sentences) that have been hand-annotated with the correct values to be learned.