1. Field of the Invention
The invention relates to distributed content storage and management, and more particularly, to content signatures for back-up and management of files located on electronic information sources.
2. Background of the Invention
Distributed content storage and management presents a significant challenge for all types of businesses—small and large, service and products-oriented, technical and non-technical. As the Information Age emerges, the need to be able to efficiently manage distributed content has increased, and will continue to increase. Distributed content refers to files that are distributed throughout electronic devices within an organization. For example, an organization may have a local area network with twenty desktop computers connected to the network. Each of the desktop computers will contain files—program files, data files, and other types of files. The business may also have users with personal digital assistants (PDAs) and/or laptops that contain files. These files collectively represent the distributed content of the organization.
Essentially, two disparate approaches to distributed content storage and management have emerged. One approach relates to backing-up files, principally for the purpose of being able to restore files if a network or computer crashes. Under the back-up approach, the focus is on preserving the data by copying data and getting the data “far away,” from its original location, so that it can not be accidentally or maliciously destroyed or damaged. Generally, this has meant that back-up files are stored on tape or other forms of detached storage devices, preferably in a separate physical location from the original source of the file. Given the desire to keep the data safe or “far away,” file organization is by file name or volume where the data is stored, and accessing or retrieving files stored in a back-up system is often slow or difficult—and in some cases, practically impossible. Furthermore, because the backed-up files are not regularly accessed or used, when a back-up system does fail, often no one will notice and data can potentially be lost.
The other approach to distributed content management relates to content management of files. The content management approach is focused on controlling the creation, access and modification of a limited set of pre-determined files or groups of files. For example, one approach to content management may involve crude indexing and recording information about user created document files, such as files created with Microsoft Word or Excel. Within current content management approaches, systems typically require a choice by a user to submit a file to the content management system. An explicit choice requirement by a user, such as this, limits the ability of a system to capture all appropriate files and makes it impossible for an organization to ensure that it has control and awareness of all electronic content within the organization.
Neither approach fully meets the growing need to effectively manage distributed content. In user environments where only a back-up system is in place, easy access to stored files is difficult and access to information about a specific file is often impossible. In user environments where only a content management system exists, many files are left unprotected (i.e., not backed-up) and the indexing and searching capabilities are limited. In user environments where a back-up system and a content management system are both used, cost inefficiencies are introduced through redundancies. Moreover, even when both a back-up system and a content management system as are in use today are in place, the ability to manage and control the electronic content of an organization remains limited.
Patent application Ser. No. '006 addressed these challenges, by disclosing a system to cost-effectively store and manage all forms of distributed content and provided efficient methods to store distributed content to reduce redundant and inefficient storage of backed-up files. Additionally, the '006 Patent Application disclosed efficient methods to gather data related to file content that will spawn further user applications made possible by the sophisticated indexing of the invention.
Another challenge arises that involves determining whether content stored is the same as other sets of stored content. For example, when content is placed into a content storage device, it is very difficult to determine if the content is the same as other sets of content in storage devices. This problem has been addressed in limited environments using checksums. For example, to determine that the bits in a PROM are not corrupt or tampered with, a checksum is calculated on the PROM's content and the result compared against the known checksum for the PROM. Determining that two files are identical is more complicated because there is little foreknowledge about which files might be identical.
In the past few years, the industry has accepted computer “backup” as a necessary part of computer management. Backup basically involves copying all content from “online” storage to some form of “offline” storage, such as tapes or writeable optical media. Since tape or optical disk mounting is a very slow process, even for an automated jukebox, it has always been preferable to collect all of the files for a particular system together on the same media to facilitate restore. That is, even if it were possible to know that a copy of a file was already stored on some media in the archives, it would be impractical to restore a system from tens or hundreds or even thousands of different tapes or optical disks.
Now that inexpensive disk storage is available, it is possible to rethink computer backup. Rather than move every “file” to offline media, simply copy it to disks in a “near-line” environment. This is becoming common, with devices, for example, from NETWORK APPLIANCES, EMC and others. In this environment it is desirable to recognize common file contents and to store such content only once. Knowing that a file has identical content to a file content that has already been saved has tremendous value. However, because finding matching files is so expensive, there are very few operations in modern computing that depend on finding identical files.
Several companies, including for example, PERMABIT, ARCHIVAS, BAKBONE, COMMVAULT, ROCKSOFT, DATA DOMAIN, UNDOO TECHNOLOGIES and AVAMAR have attempted to address this challenge. They provide file systems or solutions that are based on recognizing either common blocks or common strings of bits to reduce storage space for files. That is, when a file is stored, any common blocks or chunks of data that are common with previously stored, files are remembered with pointers. These types, of file systems are good for files that are not completely identical (e.g., email, log files, database files, etc.), but they do not automatically recognize file identicality. If all the blocks of a new file match the same set of blocks of an existing file, the files are identical, but this recognition require additional processing and is not automatic. It is possible that the variable length matching algorithms can be used to match whole files, but this will be computationally very expensive.
There have also been a number of projects that attempt to archive large portions of the Internet such as, for example, the Internet Archive project available at http://archive.org. These projects are limited to archiving web content, as opposed to files generally. Furthermore, in storing the web content they do not use a unique identifier, such as a signature. Additionally they are not back-up systems or content management systems. Moreover, they are quite limited in their searching ability in that they are not searchable by content or content attributes, but rather only by file location and dates.
What are needed are systems and methods for distributed content storage and management that can effectively and efficiently identify files that have identical content.