In our age of information explosion, thousands and thousands of journal articles, scientific papers, and other kinds of written documents are generated everyday. A particular person may be interested in certain information contained in at least some of these documents, but lack the capacity to read the documents and extract the information of interest.
Biological research regarding the living cell is one example of this problem. The living cell is a complex machine that depends on the proper functioning of its numerous parts, including proteins. Information about protein function and cellular pathways is central to the system-level understanding of a living organism. However, the collective knowledge about protein functions and pathways is scattered throughout numerous publications in scientific journals. Although a number of available databases cover different aspects of protein function, these databases depend on human experts to input data. As a result, these databases rarely store more than a few thousands of the best-known protein relationships and do not contain the most recently discovered facts and experimental details. While the number of new publications in this field grows rapidly, bringing the relevant information together becomes a bottleneck in the research and discovery process. There is, therefore, an urgent need for an automated system capable of accurately extracting information regarding protein function and cellular pathway from available literature. Moreover, such a system can find application in numerous fields other than biological research.
Conventional approaches aimed at solving this problem range from simple statistical methods to advanced natural language processing (NLP) techniques. For example, in the context of biological research, one simple way to extract protein function information is to detect the co-occurrence of protein names in a text. However, by its nature, the name co-occurrence detection yields very little or no information about the type of a described relation and, therefore, the co-occurrence data may be misleading.
Another information extraction approaches rely on the matching of pre-specified templates (i.e., patterns) or rules (e.g. precedence/following rules of specific words). The underlying assumption is that sentences conforming exactly to a pattern or a rule express the predefined relationships between the sentence entities. In some cases, these rules and patterns are augmented with additional restrictions based on syntactic categories and word forms in order to achieve better matching precision. One major shortcoming of this approach is that various syntactic constructs have to be explicitly encoded in the rules and that the number of such rules required for practical information extraction is large. In addition, relationships expressed as coordinated clauses or verbal phrase conjunctions in a sentence can not be captured.
Yet another approach utilizes shallow parsing techniques to process natural language sentences and extract information. Unlike word based pattern matchers, shallow parsers perform partial decomposition of a sentence structure. They identify certain phrasal components and extract local dependencies between them without reconstructing the structure of an entire sentence. Shallow parsers also rely on usage of regular expressions over the syntactic tags of the sentence lexemes to detect specific syntactic constructs (noun phrases or verb phrases). The downfall of shallow parsing is its inability to account for the global sentence context as well as the inability to capture long range syntactic dependencies. The precision and recall rates reported for shallow parsing approaches are 50-80% and 30-70%, respectively.
Another approach, based on full sentence parsing, is more promising. This is because this approach deals with the structure of an entire sentence, and, therefore, is potentially more accurate. One example of this approach is a semantic grammar-based system, which requires complete redesign of the grammar in order to be tuned to a different domain. Another example of this approach uses a modular architecture that separates NLP and information extraction into different modules. In this two-step information extraction system, the first step is the construction of the sentence argument structure using general-purpose domain independent parsers. The second step is domain-specific frame extraction.
Conventional full sentence NLP based information extraction approaches suffers from several problems. One problem is efficiency, relating to an intermediate step during the processing of a natural language sentence. This intermediate step involves the construction of a semantic tree of the sentence. A semantic tree is represented by a directed acyclic graph (DAG)—data structures consisting of sentence lexemes which are connected with labeled vertices. A label on a vertex indicates the logical relation between words. Syntactic variations can be reduced to a single semantic tree for efficient information extraction. (For example, the following sentences: “A inhibits B”, “B is inhibited by A”, “A was shown to inhibit B”, “B inhibitor A”, “inhibition of B by A”, “A by which B is inhibited”, and so on, are represented by a single semantic tree: inhibit{agentA, what=B}, where “inhibit”, “A”, and “B” are sentence lexemes, and “agent” and “what” are semantic labels of A and B in the parent lexeme “inhibit”.) Because of the complexities and ambiguities of the natural language, numerous semantic trees with different structures may result from a single sentence, each of which has to be processed to extract information. This results in inefficiency. Classical NLP approaches also rely on grammar to construct semantic trees. It is necessary to create and maintain complete grammar of a language, which is a cumbersome task. Therefore, a method and a system that can process natural language sentences and extract information from the sentences accurately and efficiently is highly desirable. The present invention provides such a method and a system.