Natural languages were developed to document and communicate human experiences and thoughts. Natural languages have both visual and auditory forms. Natural languages have evolved to support physical aspects of human experiences in the external world and conceptual aspects of human thoughts in the internal world of the human mind. Natural languages have also evolved to support rich expressiveness and economy of expression. Rich expressiveness and economy of expression combine to promote high semantic density in communication. Another feature of human communication in the real world is termed as semantic compounding. Semantic compounding emphasizes or highlights the significance of what is conveyed in an utterance. Often such an utterance is compounded with another utterance. The compounding second utterance may relate to conditionality, cause and effect, concurrency, presence of contrary, contradictory or unfavourable circumstances, etc. Natural language utterances involve a combinatorial explosion driven by semantic context. Semantic compounding and combinatorial explosion cannot be adequately captured in grammar representations. Natural languages have also evolved to support an imprecise or impressionistic description where such a description is adequate to implicitly infer a more precise meaning of the description based on knowledge about a population of objects in context of the description. Semantics is the core of natural language utterances and is largely experiential. Therefore, semantics constitute a major reason for the limited success achieved in processing of natural language utterances by computer programs.
Programming languages are domain neutral or domain agnostic. Programming languages per se have no large vocabulary or intrinsic domain semantics but have minimal intrinsic domain semantics at a meta-level. Each computer program or a set of related computer programs in programming languages, introduces a vocabulary specific to the computer program or the set of computer programs, creates semantics, or reflects a domain. Natural languages, however, are not domain neutral or domain agnostic. A natural language incorporates a universal domain. In an example, a computer program may comprise the following steps: Create or retrieve a virtual world of objects with certain state or states; execute certain transactions as permitted on those objects; and handle the new state of that virtual world and display to a user and/or persist the handled new state. Computer programs, typically, create the framework for an orchestration disclosed in the steps above. Therefore, natural language fiction may be comparable to computer programs. Natural language non-fiction, however, describes a commonly shared real world state of the objects, events, actions, transactions, etc.
Programming languages are ontology description languages that facilitate description of classes, or categories, or groups of objects and their behavioural patterns or types of the programming languages. Computer programs written in a programming language create their own ontologies where a specific structure, state, and behaviour are described and prescribed for each class or group of objects. Computer program runs create and maintain populations of objects that constitute the knowledge and semantic repositories known to and usable by the computer programs. The programming language, the computer program, and the population of objects created by the computer program runs, use and live in different and isolated vocabularies, meta-levels, and semantic spheres, and there is no language level facility to document and maintain the behavioural history of the objects. In the case of natural languages, the aforementioned semantic walls collapse to merge the three spheres, that is, the vocabularies, the meta-levels, and the semantics into one common whole sphere and there is a language level support to maintain the behavioural history of objects as part of the knowledge.
Another aspect of programming language communication is that each object in the population of objects has a name at the time at which such an object is active, a name by which a computer program and other objects can identify and work with such an object. In natural language communication, the large number of objects makes it impossible for all the objects to be named or for every object participating in the communication to know the name of every other object in the scene. Humans, while conversing in a natural language, typically talk about unnamed and/or unknown objects and strive to facilitate identification of these objects unambiguously by describing some aspect of such objects. While doing so, humans seek to select that aspect which is significant, meaningful, and relevant in context of the conversation, and also seek to avoid redundancy, verbosity, and awkwardness by combining such a description with the main utterance about the object, which in turn, leads to a combinatorial explosion of probable meanings that can be derived from the conversation by a system that processes the natural language conversation. Another aspect of a natural language is a facility for selective omission of contextually unimportant information, for example, grammar or a syntax support for passive voice. For example, consider the sentence “The guide explained to us the working of the machine.” Here, if contextually the guide is unimportant, the sentence can be rephrased into passive voice as follows: “The working of the machine was explained to us.” or “We were explained the working of the machine.”
Use of grammar in natural languages typically involves syntax which implies a specified sequence or an order in which words are to be used to make up a valid utterance or a valid sentence. However, natural languages typically do not enforce a rigid or an inflexible syntax. For example, Sanskrit, an ancient language from India, is an open syntax or a highly flexible language in that the words in an utterance in Sanskrit can, in most cases, occur in any order, and different utterances obtained by merely changing the order of the words are semantically equivalent. Most other natural languages are hybrid in that the order can be changed in parts of the syntax.
The origin of programming languages can be traced to the need and the desire for automating a repeated execution of the same computation task with different input values each time. Thus, the names in computer programs are symbolic or are merely place holders that hold different values or objects during different runs of the computer program. There are no semantics associated per se with any name in any computer program. When programming evolved to mimic real world processes and the data processed by a computer program mimicked population of objects in the real world, the semantics of real world objects were essentially associated with the objects in off-program persistence. During computer program runs, a symbolic name within the computer program assumes the semantics of the particular object that is loaded in from persistence, that is, a database and bound to the symbolic name. However, in natural languages and natural language utterances describing the real world, names have enduring semantics bound to them. The computer programs with symbolic names are similar to fiction in natural languages, as far as semantic association or binding is concerned.
The number of objects in the real world is substantially large for assigning a name to every object and more so for every human to know the name of every object in every context that the human needs to interact with and communicate in. As a result, natural language utterances abound in references to nameless and unknown objects, but, at the same time, eliminate ambiguity. For example, natural language sentences that refer to nameless objects are typically constructed as “the man in the brown suit”, “roads in London”, “the show at 7.00 pm”, “the painter from Paris”, “the book shop on Church Street”, “the third boy in the fourth row from the back”, etc. The mechanism employed in eliminating ambiguity is the use of semantics, where unique space and time properties associated with different objects, structures of objects, unique values for the attributes of the objects, etc., are all harnessed. Such a description of nameless objects in natural language utterances can be mapped to a programming language syntax, for example, “man{? suit.colour=brown}”, rows[4].boys[3], etc.
For ease of learning and ease of use, natural languages have evolved to support a derivation of new words from existing words. Different case, number, gender and tense forms are derived from base words. Adjectives are derived from nouns and nouns from adjectives; nouns and adjectives are derived from verbs; verbs from nouns and adverbs from adjectives, etc. Such a derivation happens, for example, by adding suffixes and prefixes and multiple ways of morphing words. Typically, every natural language has sets of morphing rules. As natural languages originated for oral communication, ease of pronunciation and the need to be not harsh on the ears have played a part in shaping the morphing rules by contributing exceptions and sub rules. Natural languages have also evolved to support multiple communication needs. Natural languages are used to describe a structure and a behaviour of objects and classes of objects, to describe actions and events, and for multiple ways of interpersonal interactions, for example, to command and to prescribe, to prohibit and to permit, to assert and to negate, to challenge and to defy, to forecast and to speculate, etc. Computer program runs or sessions typically involve a large number of short discourses, each of which is largely independent, where a context switch is abrupt and total. Natural language communications, typically, tend to have a small number of extended discourses with a marked level of semantic coherence. If and when a discourse context changes, the context switch is typically slow, gradual, and smooth, and is facilitated by connecting objects.
Processing of natural language text by computer systems has been a field of active research for a few decades. Conventional natural language processing typically results in text that expands due to addition of annotations or tags while processing the text. The main problems that arise in processing text in any natural language are analysis and understanding of sentences in the text, construction of sentences or natural language generation, conversion between visual and auditory forms, that is, speech to text conversion and vice versa, and translation from one natural language to another natural language. Natural language text processing typically involves three approaches, namely, an algorithmic approach, a statistical approach, and an artificial neural network approach. The algorithmic approach is one of the earliest approaches and processes natural languages in a way similar to programming languages. That is, the algorithmic approach considers processing of natural language text as similar to computational problems solved by structured computer programs written in programming languages. The algorithmic approach has remained unexplored in contemporary work involving natural language processing owing to limited success achieved till present date. The reasons for the limited success are, for example, peculiarities of natural languages as compared to programming languages. The algorithmic approach has, in general, been inspired by or based on an approach used in parsing computer programs written in programming languages and uses the concept of a syntax tree or a parsing tree.
The statistical approach comprises building a model of a natural language from on a large corpus of text in the natural language and applying the built model in processing new text being added to the corpus. Although the statistical approach has been quite successful, the statistical approach requires substantial computing resources to be deployed for substantial periods of time to study the large corpus and build an internal model separately for each natural language to be supported. Further, the model built statistically may be large in size, requiring a large runtime memory for processing new text whereas a grammar rule based model is compact and requires a small runtime memory. Moreover, the results of the statistical approach have a large number of errors initially that can improve only with time and usage. Furthermore, the statistical approach may not have an integrated approach to processing and generation, that is, reading and writing of natural languages. Furthermore, the statistical approach per se is not integrated with semantics.
The neural network approach uses artificial neural networks to process text. The neural network approach has been found to be successful in handling errors and omissions, for example, errors in spelling and construction of sentences, omission of words, etc. However, techniques and technology involved in the neural network approach for text generation or natural language generation are less advanced than those for text processing. Technologies and tools available for natural language processing are domain specific.
Hence, there is a long felt but unresolved need for a method and system that processes textual data in multiple natural languages. Moreover, there is a need for a method and system that processes textual data in a domain independent way by targeting semantics of textual data, and that extracts structured information that can be used for disparate computational purposes comprising, for example, business applications, answering questions, translations, etc.