The following patents or publications are noted, and are hereby incorporated by reference in their entirety:
U.S. Pat. No. 6,581,056 to Rao et al., issued Jun. 17, 2003, describes an information retrieval system and method for conducting a content analysis on a collection of documents; and
U.S. Pat. No. 5,442,778 to Pedersen et al., issued Aug. 15, 1995, teaches a Scatter-Gather browsing tool, and associated method, where a user is presented with descriptions of document groups selected from a document collection. Based upon the descriptions, the user can then selects one or more groups for further study, where the selected groups are then recombined or gathered and re-clustered and presented to the user.
Although the patents above indicate the use of clustering techniques to group documents on a network or database, it is believed that such patents were directed primarily to the textual content of documents, and not to other aspects or characteristics of the information associated with the document, particularly printing characteristics. Moreover, the patents did not proceed so far as to automatically select documents, or pages from documents, but were utilized to assist a user in the identification and review of such documents. The key distinction of the method disclosed herein is that the pages and/or documents are selected using visual appearance criteria—and are not selected as a function of information content.
Disclosed herein is a method for automatically selecting sample documents or pages from a large collection using estimated rendered visual appearance criteria. One particular application of such a method is for selection of one or more representative or extreme pages from a large document scheduled for printing, for proofing purposes. Other applications can include analysis of a large corpus of data within a knowledgebase, as a precursor to preparing the one or more transformations necessary to render the corpus of data to be suitable for presentment.
In accordance with one aspect of the method, characteristics of documents or pages are represented in a multidimensional vector space, and clustering techniques are used to group together similar pages. Characteristics may include content metadata such as color encoding descriptors and font names, or page information such as area coverage, font size, image count, etc., that are then represented and grouped. Typical pages can be chosen from the centers of identified clusters, whereas exceptional pages can be selected from cluster extrema. In this manner, both the quality of the majority and the quality risk of the outlier pages can be assessed. The multidimensional vector space can be tailored according to the application requirements. In the case of a pre-press review or proofing system, proofing requirements might dictate that the vector space be representative of printing characteristics (color imagery, layout, etc.) as will be described in more detail below.
It is anticipated that the described method will ease the task of visually proofing large and/or variable information documents, promising substantial savings in time and cost over the current manual proofing of large documents. As it is reasonable to print only selected pages for proofing purposes, and proof prints of every page may not be required, a combination of typical and extreme pages will suffice, and the described method may be employed to select representative or typical and exceptional pages for proofing. One embodiment of the method described is for production workflowsi (e.g., DigiPath, DocuSP, Enterprise Output Management System) and other workflow management systems—particularly those related to pre-press or pre-flight analysis, document creation and layout applications, variable imaging applications, and digital front ends.
Accordingly, a method is described for automatically selecting sample pages from a large document for proofing purposes. More specifically, the method characterizes pages in a multidimensional vector space and then uses a cluster-analysis technique to group the pages so that representative pages from at least one group (or a page not within a group) may be automatically selected for proofing.
Disclosed in an embodiment herein is a method for automated document subset selection from a stored body of knowledge, comprising the steps of: accessing the body of knowledge, including a plurality of documents therein; characterizing at least a portion of the body of knowledge in a characterization space; grouping the documents into a plurality of groups; and automatically selecting, based upon said grouping, a subset of the body of knowledge for presentment, including transform preparation for rendering and proofing.
Disclosed in another embodiment herein is a method for automated selection of proofing pages from a print job, comprising the steps of: receiving a multi-page document for printing; characterizing a plurality of pages of the multi-page document in a characterization space; grouping at least some of the plurality of pages into at least one group; and selecting, based upon the grouping, at least one page for presentment as a proof page.
Disclosed in yet another embodiment herein is, in electronic document processing, a method for pre-flight checking at least one electronic document intended for printing including the steps of: characterizing a plurality of pages of the document in a characterization space; grouping at least some of the plurality of pages into at least one group; and selecting, based upon the grouping, at least one page for presentment as a proof page.
The following disclosure will be characterized in connection with a preferred embodiment, however, it will be understood that there is no intent to limit the disclosure or scope to the embodiment described. On the contrary, the intent is to cover all alternatives, modifications, and equivalents as may be included within the spirit and scope of the appended claims.