Computer systems contain large amounts of information. This information includes personal information, such as financial information, customer/client/patient contact information, audio/visual information, and much more. This information also includes information related to the correct operation of the computer system, such as operating system files, application files, user settings, and so on. With the increased reliance on computer systems to store critical information, the importance of protecting information has grown. Traditional storage systems receive an identification of a file to protect, then create one or more secondary copies, such as backup files, containing the contents of the file. These secondary copies can then later be used to restore the original data should anything happen to the original data.
In corporate environments, protecting information is generally part of a routine process that is performed for many computer systems within an organization. For example, a company might back up critical computing systems related to e-commerce such as databases, file servers, web servers, and so on as part of a daily, weekly, or monthly maintenance schedule. The company may also protect computing systems used by each of its employees, such as those used by an accounting department, marketing department, engineering department, and so forth.
Although each computer system contains certain unique information, many systems may contain very similar information. For example, although a computing system used by a marketing employee and a computing system used by an engineering employee will generally contain unique information created by each employee in the course of their work, both computing systems will likely have the same operating system installed, with thousands of identical or similar files used by the operating system. Similarly, both computing systems will likely have at least some similar application programs installed, such as a word processor, spreadsheet, Internet browser, and so on. Both systems may also have similar corporate information. For example, each employee may have an electronic copy of an employee manual distributed by the company. Information other than files may also be identical or similar between systems. For example, user settings and preferences may have similar default values on each system and application programs may contain similar templates on each system that are stored as application-specific information. As another example, several employees may have received a copy of the same email, and the email may be stored in each employee's electronic mailbox.
As a result of the amount of redundant information in an organization, secondary copies of an organization's information are often very large and can require the purchase of expensive storage devices and storage media. The restoration of data in the event of data loss is also slowed by the large size of the secondary copies. As the size of secondary copies increases, locating and restoring information requires more actions to be taken. For example, it may be necessary to search many tapes or other media to find the correct secondary copy. The great quantity of storage media, such as tapes, may mean that some secondary storage media has been moved offsite requiring that it first be retrieved before information can be recovered from it. Each of these factors increases the cost of protecting information and the time required to recover information in the event of data loss. Quick recovery of information is often critical to today's businesses, and any additional delay can affect business operations and customers' satisfaction with the business.
Single instancing in a data management system is the process of attempting to store only a single instance of each file. Some prior systems permit data de-duplication, or single instancing, at a file level or at a block level, but such systems are unable to determine similar blocks of data within a given application. Data objects are often stored in large, monolithic files that are intended to be read only by the application that created them. For example, a Microsoft Exchange email server stores email messages in one or more large data files that typically hold thousands of different users' mailboxes. As another example, a database server often stores tables, forms, reports, and other data objects in one or two large data files that provide persistence for the entire database. Thus, typical data management systems are only able to perform data management operations on the large data file, rather than the data objects themselves. In the case of the email server, a given electronic mail application may generate multiple email messages that all differ, but which all contain the same attachment. Prior systems may not be able to differentiate these messages, and thus each would be stored with the attachment. Further, if two files had different properties or metadata, such prior systems would store both files, even though the data they contain are identical and differ only by their metadata.
Another problem with prior single instancing systems is that they may work fine within a given local environment, but if remote clients or devices provide data to a central single instancing system, each of the various remote clients sends data to the central single instancing system, even if much of that data is duplicative and ultimately ignored by the single instancing system. Thus, bandwidth and resources are wasted.
There is a need for a system that overcomes the above problems, as well as one that provides additional benefits.
In the drawings, the same reference numbers and acronyms identify elements or acts with the same or similar functionality for ease of understanding and convenience. To easily identify the discussion of any particular element or act, the most significant digit or digits in a reference number refer to the Figure number in which that element is first introduced (e.g., element 604 is first introduced and discussed with respect to FIG. 6).