The present disclosure relates to data processing, and more specifically, to automated comprehension of natural language via constraint-based processing.
The ubiquity of electronic devices and communication connectivity (e.g., via wired and wireless networks including the Internet) has propelled two historic trends, a hyperbolic increase in the volume of natural language material that is being created and/or made available to the public in electronic form and a shift in human communication away from the spoken and printed word to electronic communication media (e.g., electronic documents, chat, texting, video, email, streaming, blogs, web sites, etc.).
This explosion in the volume of natural language material available in electronic form has created a technological problem that did not heretofore exist, namely, a need to digest this “ocean” of electronically formatted material to distill out information relevant to a particular individual, group of individuals, enterprise or entity. Parsing may be utilized in an attempt to identify the relevant information.
As utilized herein, parsing is defined as the analysis of a text string by decomposing the text string into its syntactic components, such as words, phrases and parts of speech. Automated parsing of artificial languages, such as programming languages and scripts, can be easily implemented in computer systems given the rigorously defined syntax employed by most programming languages and scripts. Automated parsing of communication in natural (human) languages has proven to be a greater technological challenge for a variety of reasons.
For example, some natural languages such as English have irregular grammar with many exceptional conditions, idioms, multi-word concepts and other irregularities. In the prior art, it has been difficult to program a parser to identify and distinguish between all such irregularities. Additionally, in some natural languages such as English a given spelling of a word may have as many as fifteen or twenty unique meanings, often spanning multiple parts of speech. Further, it is not uncommon for spoken and written natural language to also be characterized by broken grammatical and spelling rules, ill-chosen words, incomplete fragments, and varied writing and speaking styles. For example, natural human language frequently includes idioms, phrases with non-grammatical structure, plays-on-words, implied sentence subjects or objects, and implied or misplaced prepositions. Further, written or spoken conversations often communicate a complete thought using sentence fragments containing no subject, a subject and no verb, a prepositional phrase (especially in reply to a question), or even a non-word vocalization.
Despite these departures from regular grammar, a human reader or listener can usually intuitively comprehend the meaning intended by a human writer or speaker, for example, by the word choice, context and ordering of the words, and if the words are spoken, by the tone, inflection and pacing of the words. However, in practice, it has proven difficult for automated parsing to achieve the same degree of success in identifying the meaning of natural language communication.