1. Field of the Invention
The present invention relates generally to systems and methods which preserve important data, especially as data preservation applies to systems having resource constraints.
2. Background of the Invention
Whereas the determination of a publication, technology, or product as prior art relative to the present invention requires analysis of certain dates and events not disclosed herein, no statements made within this Background of the Invention shall constitute an admission by the Applicants of prior art unless the term “Prior Art” is specifically stated. Otherwise, all statements provided within this Background section are “other information” related to or useful for understanding the invention.
Data retention and storage for all industries are growing rapidly. Information ranging from customer account information, financial transaction data, online catalogs, literature libraries, historical records, etc., are all being stored online for long periods of time. Much of this information is required to be maintained by law, regulation, or policy, such as tax regulations, securities and exchange rules, or even credit card merchant agreements.
Other types of data which are casually stored long term are also increasing in volume at a rapid rate, such as personal (private) and employee retention of electronic messages (“e-mail”) and the wide variety of attachments to those messages (e.g. word processor files, presentation files, movie files, etc.). Such archival storage requirements can be significant when considered over hundreds or thousands of email users.
All of this data must be stored somewhere, such as in a database, or in a file system on a disk drive. In more formal storage environments, a “data warehouse” may be established, using formalized retention policies, storage architectures, and allocating personnel to the data maintenance task.
Data maintenance issues emerge from data systems under constant growth when finite storage limitations are reached. Often this data growth reality is responded to by delaying the data management responsibilities, such as by simply increasing the hardware storage capacity so that larger amounts of data can be maintained.
Although data purging and data archiving methods do exist to control data storage consumption, they are often not sophisticated enough to look at data in small units, each with its own persistence priority. Without this view of data, whole data units, such as entire files, are deleted. This approach is primitive both in terms of resources freed and information integrity maintained.
To illustrate, current techniques for dealing with overloaded email mailboxes involve treating each email message individually, and archiving or deleting it in its entirety based on configurable rules. While these techniques allow the specification of a variety of predicates governing when archiving or deletion is to occur (e.g. period of time, filters based on keyword or origin, etc), the actions available are primitive (relocate, archive, delete) and do not take advantage of the inherent nature of the data.
Another attempt to alleviate long term data storage requirements is to compress data using various compression algorithms. Many of these processes monitor access activities to data units, such as entire folders or individual files, and when access activity becomes sufficiently slow, the processes compress the files or folders. However, the compressed data units are not typically directly usable by their counterpart application programs, so for example, a compressed email file cannot be opened by the originating email program, or a compressed database file cannot be opened by the originating database application. So, most of these processes are triggered to decompress the compressed files when an application program attempts to access the data unit. This approach has several disadvantages, including that it severely slows response time to access the compressed data while waiting for decompression to complete, and it does little to alleviate data storage requirements for data units which are occasionally accessed (e.g. often enough to keep compression from being performed). Techniques for lossless compression, while they can help to reduce the space occupied by a piece of data, don't ultimately solve the problem, because the data will continue to grow, but lossless compression has limits.