Present invention embodiments are related to methods, systems and computer program products for identifying common data across documents, grouping contiguous and non-contiguous common data into a larger resource, and compressing the documents.
Correspondence with customers is critical to operation of any successful company. Much of the correspondence are customer statements. Typically, when a company generates customer statements, customer statements are grouped into a report. A report can contain millions of documents, one for each customer. Much of the information in the customer statements is duplicate information. The duplicate information may be, for example, a company logo, company contact information, or overlays giving the statement structure. The duplicate information can be removed and replaced with an identifier. An enormous amount of storage space may be saved by performing this operation. On retrieval, the identifier in the statement is removed and the duplicate information is reinserted into a customer statement for presentation to a customer. Unfortunately, for small amounts of duplicate information, an identifier could be larger than the duplicate information. Further, logic and processing time for extracting and then replacing small amounts of the duplicate information may be too expensive for an amount of storage saved.