1. Field
This application relates generally to data extraction, and more specifically to a system and method for data extraction from a portable document format (PDF) file.
2. Related Art
PDF is a format for storing, viewing and publishing digital content. A PDF file can include different types of data (e.g. text, bitmaps, and images). A PDF file can be composed of a sequence of pages. Each page can include text elements, graphics objects and external image objects. A text element can include characters, position information and font information. Graphics objects include information about lines and curves. External image objects contain information about rectangular images.
The content of a PDF file is not guaranteed to be a correct logical representation of the text. For example, the various objects included in the document are not guaranteed to be in a user-readable order and/or some other logical order. This is due to the fact that the content can be optimized in order to be rendered efficiently on the screen or for printing rather than for parsing and extraction. For example, all text of a particular font might be grouped together in file regardless of where it occurs on the page itself.