The present invention relates generally to natural language processing, and more particularly to an improved method and system to adapt to the users' need for accuracy and efficiency in resolving ambiguities in natural language.
Natural language is the most common method of communication between humans, used in conversations, emails, etc. Using natural language as a common medium, people exchange information and knowledge via a well-formed but complex process. A speaker or writer first produces a message via the process of generation, whereby words are chosen and organized to best represent some information or knowledge. The listeners or readers then process the message to extract the meaning intended by the originator, completing the information transfer.
Understanding this process of information transfer is the central goal of the field of natural language processing (NLP). In doing so it would enable us to recreate the process in intelligent systems, providing computers the means to extract and operate on information and knowledge represented using natural languages. A reliable NLP model can then be used to improve tasks such as human-computer interface, information retrieval, information extraction, machine translation, and question answering.
NLP is a challenging problem because of the ambiguities prevalent in all natural languages. These ambiguities occur at multiple levels, including the word level, sentential level and discourse level.
A word can belong to multiple part-of-speech (POS) categories. For example, “check” can be both a noun and a verb. A word can have multiple senses, or is polysemous, such as the word “bank” which can mean “financial bank” or “river bank”. Even punctuations can be ambiguous, such as a period can denote end of sentences, abbreviations, decimal points and others. Proper nouns can also be ambiguous, such as “Washington”, which can refer to a person, a state, an university and others. Determining the correct meaning of a word is referred to as word sense disambiguation (WSD). Determining the correct type of proper noun is the task of Named Entity Recognition (NER).
At the sentential level structural ambiguities is the most common. The famous joke “I shot the elephant in my pajamas; how it got into my pajamas I'll never know” is ambiguous. That is, the phrase “in my pajamas” can be modifying the verb “shot”, as “attended in his nicest suit”, or it can be modifying “elephant”, as in “the clown in floppy shoes”. Resolving structural ambiguities is done by a sentential or syntactical parser.
References made by pronouns and determiners are often ambiguous and can exist both at the sentential or the discourse (cross sentences) level. In the example “Mary wanted to ask Jane before she changes her mind.”, the pronouns “she” and “her” can refer to either Mary or Jane. This is the problem of anaphora resolution.
Resolving these ambiguities is important in reliably determining the meaning and extracting knowledge represented using natural languages. Ignoring these ambiguities leads to reduced accuracy and effectiveness of systems working with natural languages. An example of this is a search engine returning documents about wristwatches when the query was “global security watch”.
One of the main challenges of accurate NLP is the combinatorial explosion if all possible combinations of ambiguities are exhaustively evaluated. This is a well-known problem in NLP and various approaches have been proposed.
A common one is referred to as a rule-based approach, where a set of rules in various forms such as grammars, first-order logic, and common-sense knowledge is used to accept and conversely reject interpretations. Given sufficient rules, a NLP model could eliminate wrong interpretations and only produce plausible ones. However, these rules and knowledge usually are authored by humans, therefore requiring monumental engineering efforts. The richness and constantly evolving nature of languages also means that this task is never complete.
A different approach is referred to as data-driven, in that machine-learning algorithms are used to train on annotated data that contain disambiguated annotations. A good illustration of these two approaches is the task of email filtering, to automatically identify unsolicited commercial email. A rule-based system would require a person to write set of rules for detecting junk email, such as one that detects the subject line containing the pattern “lose ### pounds in ### days!”. However, because of the variability of natural languages in expressing the same concept, one can see that this rule can be easily circumvented. Examples include “get rid of ### pound . . . ”, “make ### pounds disappear . . . ”, and “be ### pounds lighter . . . ”. One can see the engineering efforts needed to capture just one single concept.
An alternative is to gather a collection, or a corpus of emails, with the added annotation of whether each email is considered junk or not. A machine-learning algorithm can then be trained to reproduce this decision as accurately as possible. This requires minimal engineering effort, and one that can adapt to changing solicitations by continually training the model with new junk emails. Machine-learning algorithms are also steadily improving in accuracy, making automatic disambiguation more reliable.
However, data-driven approaches using powerful machine-learning algorithms still cannot escape the combinatorial explosion mentioned earlier. To limit the computational complexity, assumptions are often made to simplify the task.
One such simplification is to treat an input document as a “bag of words”, in that words are either present or absent, irrespective of how these words are arranged to compose sentences. This approach improves efficiency and has been shown to improve certain tasks such as text classification and information retrieval. However, it makes a very strong assumption about linguistic structure, illustrated in the following four sentence fragments:                “painting on the wall”        “on painting the wall”        “on the wall painting”        “the wall on painting”        
A “bag of words” approach would treat these four fragments all as equivalent, since they contain the same four words. However, a human reader knows that each has a different meaning.
Another simplification is to make Markov assumptions, which states that a decision at the current word is only dependent on n-previous words of the same sentence. This approach is often applied to tagging, which is the task of associating a tag with each word, such as its part-of-speech tag. This approach has been shown to be very effective for a number NLP tasks, such as POS tagging, shallow parsing, and word sense disambiguation.
However, Markov assumption makes a strong simplification that dependencies are nearby, although long-distance dependencies within natural language are well known. We illustrate this property with the following sentences:                “Apple fell”        “Shares of Apple fell”        “The man who bought shares of Apple fell”        
If only local context is used for “fell”, it would appear that “Apple” fell in all three sentences. In actuality it is “shares” that fell in the second sentence and “the man” in the third.
These long-distance dependencies can be recovered via full sentential parsing, where syntactic units such as “who bought shares of Apple” are identified as a modifier of “The man”. In doing so parsing can recover that “The man” is the subject of “fell” and not “Apple” and “shares”. Unfortunately, parsing is a very complex process with a potentially exponential time complexity with respect to the sentence length. Reliable parsing can also involve semantics, high-level knowledge, and even reasoning and inferential processes.
Even with the most efficient parsing algorithms that make certain independence assumptions, parsing has a cubic time complexity, as opposed to linear for Markovian (tagging) processes. Therefore, it can become a severe bottleneck for NLP systems and often precludes them from practical applications or large-scale deployments. For example, most users would probably consider it unacceptable for an email filter program to take one minute or more to process each piece of incoming email.
However, it is not inconceivable that in certain domains where accuracy is paramount, one would devote the resources needed for in depth analysis, such as processing legal or medical documents. What is currently lacking is a method for natural language processing that is adaptive to the need of the user, striking a balance between accuracy and available resources. Such a system would be scalable to the constantly increasing computational power, as well as improvements in NLP technologies. An adaptive and scalable method is thus more amicable to sustainable, large-scale deployments by adjusting the tradeoff between accuracy and efficiency to best match the changing needs of users and advancing technologies.