1. Field
Embodiments of the invention relate to indexing documents using internal index sets.
2. Description of the Related Art
Documents, such as WORD® documents and EXCEL® documents, may have associated document meta data (e.g., who created the document and a creation date) that may be used for indexing documents (WORD and EXCEL are trademarks of Microsoft Corporation in the United States, other countries, or both). However, the available meta data is limited, and it would be useful for a user to customize terms for indexing these documents.
In addition, ADOBE® Portable Document Format (PDF) is a document architecture from Adobe Systems Incorporated in 1993 (ADOBE is a trademark of Adobe Systems Incorporated in the United States, other countries, or both). Originally created for printing, PDF documents are now also found in great numbers on the internet. In fact, PDF has become the de facto standard for internet based documents.
Because of the internet explosion, companies are quickly moving away from their older proprietary print formats in favor of PDF. This move allows them to produce printed copies of statements (e.g., invoices) as well as host the same version of the statement for viewing on the Web (also known as the World Wide Web or WWW). Prior to this move, documents were converted from the proprietary data type to PDF. As part of this move, companies are uncovering architectural issues with the PDF format as it relates to massive, single PDF documents that include multiple statements. This type of PDF document is called a PDF report document.
For example, in order to access a single statement within a PDF report document, unique pieces of information (i.e., indexes, also sometimes called meta data) are extracted so that a user can search for a particular document. This technique of breaking up the PDF report document into individual documents and extracting indexes for each of the individual documents is called indexing. The typical technique for extracting indexes from a PDF report document is to search through the PDF report looking for text in certain predetermined locations of the PDF report document, and these predetermined locations are called the bounding boxes of the text in PDF documents.
In order to extract the text, each page of the PDF document is first graphically rendered. Then, each word of each PDF page is examined in order to determine if the word is inside a bounding box. This technique requires numerous graphic, font and floating point operations, which cause it to be slow, especially as PDF documents have become larger. That is, known indexers use graphical techniques to extract data, which is very resource intensive and prone to errors (i.e., due to font metrics, bounding boxes with rounding errors, etc.).
Thus, there is a need for indexing documents using internal index sets.