Typical enterprise computing environments consist of hundreds to thousands of client machines. Client machines may include desktops, laptops, servers and other computing devices. With such a large number of client machines, a huge amount of data is required to be protected. Further, clients may have data stored in CDs, tapes, computer servers, and other media. These data also need to be protected. Additionally, new compliance regulations exist which may require the maintenance of data for long periods of time. This results in an exponential growth of data which is protected and managed by shared protection servers. In order to provide the ability to locate the data based upon content of the data, content indexing technology is often utilized.
To deduce redundancy and reduce work load, indexing of global single instance backup data may be used. That is, a backup operation will maintain only one backup copy of a data item, and the data item may also be indexed using content indexing technology to create an entry in an index database. Any subsequent backup operation of the same data item will not create duplicated backup copies for the data item. However, traditional content indexing for backup data is achieved by traversing a complete backup item (e.g., a whole file) and creating a central content index. Content indexing is a very processor and memory intensive operation. This operation must be carried out for every backup item received for each client. Additionally, storage space for the backed up data is significant. Further, backup data needs to be transferred back and forth between different machines. Enormous amount of network bandwidth is thus consumed for backup and indexing operations.
In view of the foregoing, it may be understood that there are significant problems and shortcomings associated with current methods of indexing backup data.