The present invention relates generally to the field of data processing, and more specifically to annotating embedded tables for text analytics.
Text analytics systems extract free text from a whole range of different document formats (e.g., plain text, Word, PDF). The extracted text may be treated as a sequence of bytes which are analyzed by the text analytics components. During the extraction process, critical elements such as tabulation or any proprietary tags within the source document type (e.g., PDF) are often lost. This results in the extracted text losing its formatting when it is reintroduced into another document format.