Rapid hardware and software advances have made it possible to process massive amounts of data since the 1980s. However, these advances work well only when processing “well-described” and “well-structured” data.
Software that reads and “comprehends” complex, free-form, unstructured data has been researched extensively since the advent of computers in 1960s. This area has been a subset of Artificial Intelligence research referred to as NLP (Natural Language Processing) and NLI (Natural Language Inference).
More recently, using the distributed global data sharing capabilities of the Internet as “The Semantic Web” for meaningful understanding and usage of wide spread information has also been pursued (dubbed “Web 3.0”) but as yet unfulfilled, because it requires information to be published with additional complicated technical information.
Gartner, a leading IT analyst firm, has estimated that 40 Exabytes (4×10 19 bytes) of new information was generated in 2009, with 80% of that being unstructured, thus, making it hard for current computers to understand and use it. The raw/unstructured information on the Internet continues to grow at a very fast pace for the foreseeable future. Much of it requires human processing because computers cannot comprehend the meaning imparted by text.
An intelligent computer program that can use comprehension (understanding of meaning) advances to process this huge amount of text data will provide benefits in finding, using, re-structuring, auditing and saving important information in a context-specific and reason-specific manner.
The current state of art in automated text processing is limited in several ways, such as:
(1) keyword matching for searches;
(2) synonym matching;
(3) part-of-speech (syntax) determination—tagging verbs, nouns, adjectives, etc.;
(4) taxonomy, which may be considered categorizing text topics in one or more hierarchies for easier retrieval;
(5) ontology, which may be considered the determination of concepts mentioned in the text and creating a concept link map;
(6) proximity based heuristic or probabilistic interpretation of “logic” operators, i.e., “not”, “and”, “or,” which generally do not work well with complex sentences; and
(7) various combinations of the well-known NLP lexical analysis approaches, e.g., tokenization, sentence boundary detection, abbreviation expansion, normalization, part-of-speech (POS) tagger, noun phrase extraction, concept extraction, named entity recognition, relation extraction, quantifier detection and anaphora resolution.
Many universities and corporations have spent significant efforts to build intelligent software that can provide “semantic” or “meaning-based” language processing capabilities with trained-user-like accuracy, to no avail so far. Currently the human brain can easily surpass even the most sophisticated computer in “understanding” the meaning of even medium complexity text in a natural language.
There are at least eight basic deficiencies in today's natural language processing. Examples of these deficiencies include:
(1) Computers today are unable to link meaning of various words to each other in a complex sentence to derive the complete, multi-faceted meaning of the sentence in an accurate and reliable manner. Part of this inability lies in the complexity involved in “disambiguating” the correct meaning out of multiple possible meanings of a word or phrase and “linking” that meaning to the correct OTHER words in the sentence. Disambiguating, or more specifically, disambiguation of text, involves the ability to determine the author-intended meaning of a word or phrase among all possible meanings of that word or phrase. Linking, or more specifically linking of text, involves, for each word or phrase in a sentence, the ability to determine other words and phrases in the current, prior or following sentences to which the word or phrase relates as intended by the author/originator of the text; and to determine the type of author-intended semantic (i.e., intended-meaning) relationships among such words. This feature is referred to herein as “Advanced Disambiguation and Linking”;
(2) Computers today have no framework for semantic placement of meaning of words, phrases and sentences, so that various combinations of meanings can be compared with each other to understand complex references to already mentioned objects and to logically answer questions, e.g., a computer does not know that “the hill to the left of the river” may also be referred to as “the hill having the river to its right” in a subsequent sentence. Computers today also do not understand conditionality in natural language, e.g., IF, ELSE, OTHERWISE, ONLY IF, etc. This feature is referred to herein as “Logical Comprehension”;
(3) Computers today cannot derive even simple real-life inferences that humans take for granted, e.g., there is no automatic way for a computer to know that when a car travels faster, it takes less time to reach its destination. This feature is referred to herein as “Inference from Laws of Nature” or “Common Sense”;
(4) Computers today cannot understand or comprehend meaning of multiple complex sentences (referred to as “discourse intelligence” in linguistics) and apply it to answer complex questions in an accurate and reliable manner. This feature is referred to herein by its common linguistics name of “Discourse Intelligence”;
(5) Computers today cannot “learn” new (i.e., not previously encountered) words and phrases, understand them and start using them to answer questions. For example, a computer that has not encountered the word “Alaska” does not know that Alaska is a state in the United States of America, and its other descriptive attributes. A human can readily look up the definition of the word “Alaska” in a dictionary and use that additional information for a useful purpose but a computer cannot absorb and use new words from their definitions from electronic dictionaries. This feature is referred to herein as “Vocabulary Learning”;
(6) Computers today cannot “learn” new (i.e., not previously encountered) descriptive information about words and phrases that they do know. For example, a computer that does know the word “Lunar Eclipse,” cannot look up additional information about “Lunar Eclipse” from a reliable source like an electronic encyclopedia. An example of an electronic encyclopedia is the website wikipedia.org, which describes a “Lunar Eclipse” as when “the Moon passes directly behind the Earth into its umbra (shadow). This can occur only when the Sun, Earth, and Moon are aligned exactly, or very closely so, with the Earth in the middle. Hence, a lunar eclipse can only occur the night of a full moon.” A human can readily read about a lunar eclipse and use that additional information for a useful purpose but a computer cannot absorb and use new information. This feature is referred to herein as “World Knowledge Learning”;
(7) Computers today cannot easily understand the context or environment under which any given natural text is to be evaluated. For example, the sentence “John is strong” may mean a good thing if John is our friend, but a bad thing if John is our enemy. Similarly, computers today cannot adapt to different contexts on demand to evaluate the importance of information in natural language text (wherein natural language text is generally defined as any text content (e.g., words, phrases, clauses, sentences, paragraphs, chapters, pages, books, articles, etc.) composed by a person or a machine in any spoken or written natural language to document, express or convey a description of past, present or future thoughts, events, emotions, experiences or beliefs regarding any topic, the topics being real or imaginary, tangible or intangible from various contexts). This feature is referred to herein as “Context Switching”; and
(8) Computers today cannot comprehend and answer complex questions asked in natural language by users or other computer systems, because that in turn requires Advanced Disambiguation and Linking, Logical Comprehension, Inference from Laws of Nature, Discourse Intelligence, Vocabulary Learning, World Knowledge Learning and Context Switching. This feature is referred to herein as “Advanced Answering.”
Therefore, it would be advantageous to provide new and improved methods, systems and computer programs for natural language processing, that overcome at least one of the aforementioned problems.