Some databases store an enormous number of documents. These documents can be stored in a single warehouse or distributed throughout many different repositories. As part of content management, these documents are periodically merged and groomed. In some instances, documents from two different repositories are merged into a larger repository. For example, multiple collections of documents are coalesced to reduce maintenance overhead.
One challenge in content management is to identify documents that are duplicative of each other. Duplications emerge when documents or portions of documents are copied and stored again. In other situations, newer or updated versions of documents are stored, but the antiquated versions are not deleted from storage.
For many reasons, the proliferation of duplicative documents is undesirable. Redundant copies require extra storage space. Further, duplicative documents burden resources, especially during document searches. If a document is irrelevant or outdate, then it can pollute a list of search results. Reducing the number of duplicates and overlapping documents (or documents containing portions of other documents) can reduce the number of documents shown and, thus, enhance productivity.
Situations also exist when information in documents reaches an end of its retention period. Regulatory compliance, for example, can dictate that certain information must be expunged from a repository. If the content of a document has been copied into other documents, these documents need to be identified for pruning at the time the original document is deleted.