Natural language processing (NLP) systems are computer-implemented methods for taking natural language input (for example, computer-readable text), and operating on the input so as to generate output that is useful for computers to derive meaning. Examples of NLP systems applications include spell checkers/grammar checkers, machine translation systems, and speech-to-text systems. Increasingly, there is interest in developing methods for machines to more intelligently interpret human language input data (such as text) for the purpose of directing the computer as if it were another person who could understand speech. One application for such methods is search engines that receive a typed query from a person and perform web searches to attempt to generate a set of meaningful answers to the query. An important subclass of NLP systems is NLP parsers, especially grammatical parsers such as Part-Of-Speech tagger, constituency parsers, dependency parsers, and shallow semantic parsers such as SRL (Semantic Role Labeling). Their role is to preprocess text and add additional information to words to prepare it for further usage. Current NLP systems are mostly built on top of NLP parsers and current NLP systems rely heavily on the information produced by these parsers to provide features and accuracy. Quality of the information delivered by these parsers is strongly correlated with the efficiency of NLP systems.
FIG. 1 is a block diagram of prior art NLP parsers 100. The input is a text, which is processed by NLP parser (102) consisting of machine learning techniques (104) trained on manually annotated corpus (105). The parser (102) produces the output (103) which is then used by other systems/applications (106).
All current parsers are dependent on corpora and therefore on the context in which they were written. Typically corpora are in a context of correctly written, grammatically correct sentences and common syntactic structures which are manually annotated by humans. The system is then trained using this corpus.
This is one reason that traditional NLP parsers are most accurate on the same type of content they were trained on (the same corpus). That is why always-changing language, such as user generated content (e.g. reviews, tips, comments tweets, social media content) presents a challenge for NLP parsers built with machine learning techniques. Such content often includes grammatically incorrect sentences and non-standard usage of language, as well as emoticons, acronyms, strings of non-letter characters and so on. This content is constantly changing and expanding with different words or syntactic structures. All of this content has meaningful information and it is easy to understand by humans but it is still difficult for NLP applications to extract meaning out of it.
One way in which current NLP parsers can be updated or improved (for better accuracy or extracting additional information) is to modify the existing corpus, or create a new corpus or re-annotate existing one and retrain the system with it to understand new content. However, this is a tedious, costly and time-consuming process. For example all current NLP parsers are using corpus as a training data annotated by linguists with predefined tags (e.g. Penn Treebank)—especially use machine-learning algorithms.
If there was a need to distinguish the pronominal or adjectival aspect of “that” (giving them different POS tags in different context), linguists would need to manually re-annotate all the sentences in the whole corpus that contain the word “that” regarding the context of each usage of “that” and retrain the parser.
Building a particular application on top of an NLP parser requires building a module to transform the NLP parser output into usable data. The application using the parser's output could be coded in a programming language, use a rule based systems or be trained with machine learning techniques—or created with combination of any of the above methods.
Using NLP parsers can be challenging due to the need to understand the structure of the output and parameter's (requires expert knowledge). One of the challenges in NLP parsers is to provide constant consistent structure of information. Also, the output of the NLP parsers rely on the quality of the input text data.
For example,
Let's consider these sentences
                1. John likes math.        2. John likes to learn.        3. John likes learning math in the evening.        
By using grammar parsers in each case you will get different notations for the object that John likes.
In constituency parsers the number of levels (depth) in a parse tree depends on the length and the grammatical structure of the processed sentences. In the given example above the first sentence has 3 levels, the second sentence has 0.5 levels and the third example has 6 levels in a tree representation.
In state-of-the-art dependency parsers the structure of the output and number of levels in the dependency tree representation also vary. Adding even one word in the sentence can alter the grammatical information of all the other words.
The given example about John would produce different structure for each sentence. The first sentence require extracting dependents of “dobj” relation connected to the word “likes”, in the second all dependents of “xcomp” relation connected to the word “likes” and in the third example there is a need for analyzing all governors connected to dependents of “xcomp” related to the word “likes”
All of the above is the reason why it is difficult for people and especially non-linguists (e.g., developers, analysts) to use the parser output and write rules to adjust it to their current needs. For example, to write an information extraction engine to extract information about product features from reviews you could use a constituency parser or a dependency parser but you need to write complex algorithms to search through the parse tree. To move to another domain (e.g. extracting information from twitter) the algorithms must be redesigned, and part of the code rewritten.
To deal with these problems NLP systems use machine learning techniques. This approach has some limitations in terms of accuracy and amount of extracted information.
There are query languages to process structured data (e.g. SQL for relational databases, Cypher for graph databases, SPARQL-RDF tagged texts (resource description framework)) but there are no languages designed directly to query the structure of the natural language (output of the NLP parser).
It would be desirable to have an efficient framework for storing information decoded from text. It should provide an invariant and consistent way of storing information which would be insensitive to different types of input. Having such a framework, it would be possible for non-experts to write efficient rules on top of the NLP parser's output.
It would be desirable to have a parser for natural language processing that is built fully algorithmically so it allows for constantly improvement in accuracy, and the addition of new features, without building or re-annotating any corpus. It would desirable to have an NLP system that is more capable than current NLP parsers of dealing with non-typical grammatical input, deals well with constantly-changing language on the web, and produces accurate output which can be stored into an efficient framework of information.
It would also be desirable to have a query language that can be used on the logical layer across different input contexts allowing humans to write efficient rules for extracting information, and that is capable of effectively leveraging many NLP systems.