Field
The present specification generally relates to systems, computer-program products and methods for annotating documents and, more particularly, to systems, computer-program products, and methods for annotating documents with multiple entities found in a controlled vocabulary extracted from a single compound noun phrase.
Technical Background
Electronic text documents may be annotated with information. Annotations may be provided in metadata, for example. Markup languages, such as XML, may be utilized to provide additional information regarding an electronic text document beyond the original text. In some cases, an electronic text document is annotated with information regarding the subject matter discussed within the electronic text document.
Compound noun phrases are multiple word phrases that comprise at least one modifier and a head. For example, in the compound noun phrase “thin film,” the word “thin” is the modifier and the word “film” is the head. In some instances, a compound noun phrase may have multiple modifiers, such as “epitaxial thin film,” wherein both “epitaxial” and “thin” are modifiers that modify head word “film.” Such compound noun phrases may be referred to as interdigitated terms. In the present example, the word “thin” appears between “epitaxial” and “film.” In current systems, term annotations are disallowed on electronic document texts if there are meaningful intervening words or tokens. However, multiple phrases may be intended by an interdigitated term. Electronic text documents are therefore not annotated with information regarding these hidden phrases.
Accordingly, a need exists for alternative methods for extracting information from single compound noun phrases to provide additional annotation information for electronic text documents.