Natural languages evolved to document and communicate human experiences and thoughts. Natural languages have both visual and auditory forms. Natural languages have evolved to support physical aspects of human experiences in the external world and conceptual aspects of human thoughts in the internal world of the human mind. Natural languages have also evolved to support rich expressiveness and economy of expression. Rich expressiveness and economy of expression combine to promote high semantic density in communication. Another feature of human communication in the real world is termed as semantic compounding. Semantic compounding emphasizes or highlights the significance of what is conveyed in an utterance. Often such an utterance is compounded with another utterance. The compounding second utterance may relate to conditionality, cause and effect, concurrency, presence of contrary, contradictory or unfavourable circumstances, etc. Natural language utterances involve a combinatorial explosion driven by semantic context. Semantic compounding and combinatorial explosion cannot be adequately captured in grammar representations. Natural languages have also evolved to support an imprecise or impressionistic description where such a description is adequate to implicitly infer a more precise meaning of the description based on knowledge about a population of objects in the context of the description. Semantics is the core of natural language utterances and is largely experiential. Therefore, semantics constitutes a major reason for the limited success achieved in processing of natural language utterances by computer programs.
Use of grammar in natural languages typically involves syntax which implies a specified sequence or an order in which words are to be used to make up a valid utterance or a valid sentence. However, natural languages typically do not enforce a rigid or an inflexible syntax. For example, Sanskrit, an ancient language from India, is an open syntax (no prescribed word order at all) or a highly flexible language in that the words in an utterance in Sanskrit can, in most cases, occur in any order, and different utterances obtained by merely changing the order of the words are semantically equivalent. Most other natural languages are hybrids in that the order can be changed in parts of the syntax.
For ease of learning and ease of use, natural languages have evolved to support a derivation of new words from existing words. Different case, number, gender and tense forms are derived from base words. Adjectives are derived from nouns and nouns from adjectives; nouns and adjectives are derived from verbs; verbs from nouns and adverbs from adjectives, etc. Such a derivation happens, for example, by adding suffixes and prefixes and multiple ways of morphing words. Typically, every natural language has morphing rules. As natural languages originated for oral communication, ease of pronunciation and the need to be not harsh on the ears have played a part in shaping the morphing rules by contributing exceptions and sub rules. Natural languages have also evolved to support multiple communication needs. Natural languages may be used to describe a structure and a behaviour of objects and classes of objects, to describe actions and events, and for multiple ways of interpersonal interactions, for example, to command and to prescribe, to prohibit and to permit, to assert and to negate, to challenge and to defy, to forecast and to speculate, etc.
Natural language communications, typically, tend to have a small number of extended discourses with a marked level of semantic coherence. If and when a discourse context changes, the context switch is typically slow, gradual, and smooth, and is facilitated by connecting objects.
The main problems that arise in processing text in any natural language are analysis and understanding of sentences in the text, construction of sentences or natural language generation, conversion between visual and auditory forms, that is, text to speech conversion and vice versa, and translation from one natural language to another natural language. Natural language text processing typically involves three approaches, namely, algorithmic approach, statistical approach, and artificial neural network approach. The algorithmic approach is one of the earliest approaches and processes natural languages in a way similar to processing programming languages. That is, the algorithmic approach considers processing of natural language text as similar to processing computational problems solved by structured computer programs written in programming languages. The algorithmic approach has remained unexplored in contemporary work involving natural language processing owing to limited success achieved through it till present date. The reasons for the limited success are, for example, peculiarities of natural languages as compared to programming languages. The algorithmic approach has, in general, been inspired by or based on an approach used in parsing computer programs written in programming languages and uses the concept of a syntax tree or a parsing tree.
The statistical approach comprises building a model of a natural language from a large corpus of text in the natural language and applying the model so built in processing new text being added to the corpus. Although the statistical approach has been quite successful, it requires the deployment of substantial computing resources for substantial periods of time to study the large corpus and build an internal model separately for each natural language to be supported. Further, the model built statistically may be large in size, requiring a large runtime memory for processing. Moreover, the results of the statistical approach have a large number of errors initially and can improve only with time and usage. Furthermore, the statistical approach may not have an integrated approach to processing and generation, that is, reading of and writing in natural languages. Furthermore, per se, the statistical approach is not integrated with semantics.
The neural network approach uses artificial neural networks to process text. The neural network approach has been found to be successful in handling errors and omissions, for example, errors in spelling and construction of sentences, omissions of words, etc. However, techniques and technology involved in the neural network approach for text generation or natural language generation are less advanced than those for text processing. Technologies and tools available for natural language generation are domain specific.
Automation of natural language processing has three broad areas: Understanding natural language text (abridged as NLU), Speech Processing (comprising of: Speech recognition and speech to text conversion—abridged as STT and Text to Speech Conversion—abridged as TTS) and natural language text generation (abridged as NLG) either in spoken form or text form.
When a human speaks in a natural language or writes natural language text, he or she is encoding, in that natural language, a certain chunk of his or her own knowledge or world-view contained within his or her brain. That knowledge may have been acquired by direct use of the sense organs or by reading text in some natural language, which may have been different from the one in which speaking or writing is being done. In natural language text generation by a computer system, the objective is to generate text, in a specified natural language, that encodes a chunk from a repository of structured knowledge or information available to the computer system.
Significant progress has been achieved in the areas of NLU and Speech Processing. Comparable progress has not been achieved in the area of NLG. The text generation systems that have been invented and put to use are generally domain-specific, language-specific, limited in purpose (such as producing weather reports from weather-data) and have been template driven. There has been a long felt but unfulfilled need for a more versatile method and system for automated generation of textual descriptions of any knowledge encoded in a structured format. As a consequence, progress in the areas of machine translation and abstractive summarization has also not been realized to the desired extent. This patent application describes a method and system for template-less, domain-independent and language-neutral automated text generation, which can be put to use under a wide variety of situations. This invention enables a substantial advance in automatic abstractive summarization and machine translation capabilities.