Recently businesses, individuals and other entities have increasingly used digitally-based information systems for documentation. By way of example and not by way of limitation, such digitally-based information may include aircraft maintenance manuals, parts catalogs or other documentation employed in lieu of traditional paper or microfilmed manuals or other documents. Many of the digitally-based information systems use an industry-standard Portable Document Format (PDF) for document storage. One benefit of storing documents formatted in PDF is that the documents so stored have a substantially fixed appearance regardless of the device used to render their display. Such uniformity of appearance may give documents the look and feel of paper versions of the documents.
However, as is the case when using paper documents, there may be no comprehensive index into the subject matter of a PDF document. This is a deficiency which detracts from the document's overall usability whether configured in paper form or configured as digitally-based information. A company may have a library that includes many thousands of documents in a digital format. Each document may be configured with its own respective layout and authoring idiosyncrasies. It could be an extremely difficult task to write and maintain all the software needed to extract appropriate data from so many varied digitally-formatted documents to build an index for each of those documents.
By way of example and not by way of limitation, the information needed to create a meaningful index for a maintenance document may be gleaned by examining three general parts of the document: titles, tables, and repeating text patterns. Certain textual characteristics (e.g., layout, capitalization and underlying patterns) of the document parts may be determined by examining a representative sample of their occurrences in an exemplary document configured using substantially the same digital format.
It would be useful to have a software tool to automatically extract and index the desired titles, tables, and other text patterns from the document. To build such an automatic indexing tool, it would be useful if only some knowledge of pattern recognition and regular expressions were required, but no specific computer programming skills were needed by a user. It would be useful if the tool can be automatically constructed from information entered by a user who is merely familiar with the contents of the document without having significant computer programming skills.
There is a need for a method and apparatus for automatically creating an index for a digitally-formatted document.