This invention pertains to apparatus and a method which are useable, in conjunction with the creation of a document data-stream, derived from the initiation of an imaging job, such as a print, copy, scan, fax or e-mail job, to create a storable and reviewable, content-informative audit trail. This audit trail is based upon extraction from such a data-stream of a small quantity of both text and imagery data that are sufficient to furnish a reviewing party with an understanding of the content of the document to which a selected data-stream relates. Audit-trail material is variously referred to herein also as a data-collection content surrogate, and as a data content sub-collection.
Content extraction can take place with respect to each, or less than all, of the different pages in a document. It can relate to portions only of one or more pages in a document, to sub-portions of text and imagery content, and in fact to any other content feature of an imaging job document data-stream which will be sufficient to inform a later-reviewing party about the nature of the content of the document. Additionally, stored audit-trail content material may be derived from a selective practice of abstracting different kinds of information drawn from pages in a document, and it may also be based upon later-performed content extraction from previously extracted material in order to minimize required storage space.
The present invention does not focus attention on any specific algorithm for performing extraction and/or reduction in storage size of extracted data, nor does it depend upon the specific location in a document data-stream and related system from where content extraction takes place. In general terms, such extraction may take place at any point in a system is functionally downstream from the point at which the relevant data-stream is first created or initiated. Further, and as was just suggested briefly above, the invention contemplates that, while a first-level extraction and storage of reduced-content data may be quite sufficient for initial storage purposes, over time it may be desirable to reduce further the storage space occupied by such extracted material by implementing a practice of time-cyclic re-extraction and further reduction of document data content over time. Thus, as storage files grow large, the invention contemplates that these files may individually and internally be even further reduced, so long as the reduction “product” is still capable of informing a reviewing party about the nature of the document content from which the extracted information was first drawn.
Various illustrations are provided herein, and are discussed to illustrate the breadth of capability offered by the apparatus and method of this invention. These illustrations should be understood to be representative of the practice and the structure of the invention, and not exhaustive or limiting of its scope of implementation.