A natural language (NL) is a scripted (written) or a vocalized (spoken) language having a form that is employed by humans for primarily communicating with other humans or with systems having a natural language interface.
Natural language processing (NLP) is a technique that facilitates exchange of information between humans and data processing systems. For example, one branch of NLP pertains to transforming human readable or human understandable content into machine usable data. For example, NLP engines are presently usable to accept input content such as a newspaper article or human speech, and produce structured data, such as an outline of the input content, most significant and least significant parts, a subject, a reference, dependencies within the content, and the like, from the given content.
An NL input is an input in constructed using a grammar of a natural language and presented in a suitable form, including but not limited to text, audio, and forms thereof, such as transcription from audio speech, machine-generated audio from text. A unit of an NL input is the shortest meaningful portion of the input. For example, in the English language, a unit would be a word; and words form other larger structures such as phrases, sentences, and paragraphs in the NL input. A unit of an NL input is also referred to herein as a token.
Presently algorithms are available to enable machines in understanding NL inputs. An essential part of understanding the NL input is repeatedly and reliably selecting the correct choice from the many likely machine-interpretations of an NL token. For example, a machine should be able to conclude that “tow-mah-tow” and “tuh-may-tow” are simply different ways of saying “tomato” and when “tow-mah-tow” is presented as an NL token, the correct selection or choice for that token is “tomato”.
The illustrative embodiments recognize that machine-understanding of a token is sensitive to a number of factors. In some cases, an emphasis placed on a token or a portion thereof can cause an incorrect selection corresponding to the token. In some other cases, a dialect, an accent, a locality of the NL input affects the meaning of the token. Additionally, there can be multiple valid choices corresponding to a token but only one of them correct based on the factors involved.
The factors contemplated by the illustrative embodiments are related to the phonetic variations of a token as described herein. As such, the factors contemplated by the illustrative embodiments, which affect machine-understanding of NL tokens, are distinct from misspelling and typographical errors-type of reasons that affect correct token recognition. Presently, techniques exist to help an NLP machine to select the correct choice when misspelled tokens are encountered in textual NL inputs. Several misspelled tokens are mapped to the same correct word, e.g., misspellings such as “tirminate”, “termate”, and “termenate” are mapped to the correct selection—“terminate”- to assist the NLP machine to make the correct selection when a misspelled token is encountered.
Some presently used NLP algorithms build large caches of misspellings mapped to correct spellings. Such caches can be large, but they are still far from exhaustive. For example, just for the English language cache, a single eight-character word can theoretically have 268 (208,827,064,576) possible variations. Some algorithms in this class of algorithms optimize the cache, e.g., by including only the most common misspellings. Still, the cache of mappings remains far from complete, is not scalable, and handles only a limited type of issues—the misspellings in textual inputs.
Fuzzy matching is another class of algorithms used to map an NL token to a choice or selection from a set of selections. A fuzzy matching algorithm (FUZZY MATCHING ALGORITHM) is a string matching algorithm that uses variations of edit distance algorithms as a means of finding similarities between a given token string from textual input and an available selection string in a set of selections. Fuzzy matching algorithms also operate on textual NL inputs, and are presently configured for correctly understanding misspelled character strings.
Presently, fuzzy matching algorithms are designed to have a high recall at the cost of sacrificing precision. Recall is a fraction of relevant instances that are retrieved, and precision is the fraction of retrieved instances that are relevant. Precision can be seen as a measure of exactness or quality, whereas recall is a measure of completeness or quantity. Maximum precision indicates no false positives, and maximum recall indicates no false negatives.
The illustrative embodiments recognize that factors other than misspellings in textual inputs are responsible for precision of understanding NL tokens. Such factors are dependent upon the tonal or phonetic characteristics of the token rather than the correctness or incorrectness of the textual spelling of the token.
The illustrative embodiments recognize that a method is needed by which the phonetic variations of tokens can be represented in NLP so that the fuzzy matching application increase in precision while keeping the recall characteristic unchanged when making selections corresponding to NL inputs. The illustrative embodiments recognize that the presently available fuzzy matching algorithms have to be modified to be able to use phonetic characteristics of tokens as additional inputs in determining the correct selection corresponding to the token.