The invention relates to the field of text analytics. In particular the invention relates to an improved text analytics application for a rule based apparatus for modifying word annotations.
Text analytic solutions involve the process of annotating data within natural language documents with information. The annotations allow a text analytics application to scan information written in a natural language in order to extract information and populate a database or search index with the extracted information. Information is extracted from a document according to a set of rules defined in the text analytics application.
Text analytic applications typically comprise two types of rules. These two types of rules can be typically categorised as follows:
1. Dictionary Rules—these rules define the annotations that should be applied whenever a specified phrase is encountered. For example, the phrase ‘International Business Machines’ should be annotated as an ‘Organisation’.
2. Grammatical Rules—these rules define the annotations that should be applied whenever a grammatical pattern is encountered. For example:
a. a grammatical pattern comprising the phrase ‘member of’ followed by any ‘Name’. When encountered the ‘Name’ annotation should be changed to an ‘Organisation’ annotation. For example, this grammatical rule would identify IBM as an ‘Organisation’ in the phrase ‘he is a member of IBM's staff’.
b. A grammatical pattern comprising a ‘Name’ followed by a ‘Verb’ followed by a ‘Name’. This pattern can be extracted into a Subject-Object-Predicate triple for use in a semantic knowledge base.
When presented with a test corpus of documents, text analytics applications are designed to identify those parts of the document that will cause a rule to be triggered. For example, the need to identify the occurrence of dictionary terms within a document would trigger dictionary rules when the text analytics application scans the document and locates a dictionary term.
The rules used in text analytic applications are developed in a hierarchical arrangement such that when one rule is triggered the triggered rule may trigger other rules. Consider for example, the annotation sequence that would occur for the following sentence: ‘John works for Rob & Benn.’
Using a simple set of dictionaries it is possible to annotate each term in the sentence as shown below:
John->Nameworks->Verbfor->PrepositionRob->Verb&->  ConjugateBenn->Noun.->  Punctuation
Rule 1 is a dictionary rule that states the word ‘Benn’ is a ‘Name’.
Rule 2 is a grammatical rule that states and ‘Capitalised Verb’ (i.e. ‘Rob’ in this case) that is followed by a ‘Conjugate’ and a ‘MaleName’ should be annotated as a ‘Name’.
Rule 3 is a grammatical rule that states a phrase comprising a ‘Verb’ followed by a ‘Preposition’ should be annotated as a ‘VerbPhrase’.
Rule 4 is a grammatical rule that states a ‘Name’ preceded by a ‘VerbPhrase’ with a text value of ‘works for’ should be annotated as an ‘Organisation’.
Rule 5 is a grammatical rule that states if an ‘Organisation’ is followed by a ‘Conjugate’ followed by a ‘Name’ followed by a full stop then the ‘Name’ should be included in the ‘Organisation’ annotation.
In this example, each of the rules would trigger and result in the updated annotations below (the example has been designed so that rules trigger in a specified order, however, a person skilled in the art will realise that a set of rules may trigger in any given order):
John->Nameworks for->VerbPhraseRob & Benn->Organisation.->  Punctuation
Conventional text analytic applications parse documents looking for patterns that match the rules. So, in the example above, the application will first test ‘rule 1’ against the word ‘John, followed by the word ‘works’, followed by the word ‘for’, and so on until the end of the document is reached.
Whenever a rule triggers and an annotation is changed the cursor must be reset to the earliest point at which the change could impact the likelihood of another rule triggering. The parsing of the document must then be repeated until all matching rules have also triggered.
After the test corpus has been parsed by the text analytics application, the impact of any changes in the dictionaries, rules or corpus content can only be determined by re-parsing the entire corpus.
Thus, there is a need in the art to provide an improved text analytics application that eliminates the need for the re-parsing the entire document corpus.