The invention relates generally to data processing, and more specifically, to processing misaligned annotations between a source file and an offset annotation file.
In-line annotations are often used in tagged files such as HTML and XML. The annotations are considered ‘in-line,’ as they are placed directly within the corresponding text file, and the text file (as well as the associated tags), can then be parsed and indexed for use in various applications.
Where a document or file does not contain in-line annotations or mark ups, one can use a tagger (e.g., a named entity tagger or part-of-speech tagger) to tag the document, which is then indexed. Alternatively, offset annotation techniques may be used. An offset annotation file includes annotations associated with a source file, which annotations are not ‘in-line’ with the source file. In particular, the source file itself is unmarked; rather, the offset annotation file contains the information that informs an analysis engine which tags are associated with which text spans, for subsequent use in indexing or other text processing.