In general, software applications read data by sending file requests to a file system or file system interface. The data is then processed in some fashion by the application. Files often contain portions that are duplicates of other file portions. As a result, applications may read and process duplicate files or duplicate regions within the files multiple times. Unfortunately, reading and processing duplicate files or regions within the files increases disk usage, processing power, and memory consumption.
Recently, file systems have been used to deduplicate files and content in order to detect identical files or identical portions of a file. Identification of identical portions of one or more files can be used to maintain a single copy of the content instead of maintaining multiple copies of the same content. Thus, duplicate files or regions within files may be reduced to a single footprint instead of multiple footprints, thereby reducing storage requirements. Deduplication has therefore been used to reduce memory storage requirements (and network data transfers) within a file system. Deduplication occurs at many places in the enterprise storage, e.g., right from online file systems to backup and archival systems.
There are many applications that need to scan the file system and process the data in bulk, some examples include antivirus scans, keyword search engines, data classification applications for e-discovery, data leak prevention applications, archival applications, backup applications, replication engines or even plain data migration applications. Today, if the same segment of data is shared by 10 different files, then these applications are required to process this segment 10 different times, even though a data deduplication program has previously detected that this segment is shared by the files.
In other words, file system deduplication processes keep internal the information regarding which files and which file portions are duplicated. The file system deduplication processes do not provide this internal information to outside applications. Typically, the solution has been to expect the application to check if it has already processed the same data, and if yes, to skip the data. For example, the same way as source based deduplication checks if the data is already present with the backup target and if yes, it will not send the data over the network.
To partially address this, some applications build their own respective index to track contents that have been read and processed. Unfortunately, each application is required to track this information individually because conventional file system deduplication processes keep internal information regarding which files and which file portions are duplicated and processed. Requiring each application to independently maintain such an index is burdensome on the application, increases resource usage and decreases processing efficiency.
Requiring applications to track their own data usage is not only burdensome but it is also a poor solution for the following rationale. Consider a keyword indexing engine that scans all the data in a backup or archival image, and constructs an index which can be used to do a keyword search. Assume also that it scans a segment S for a file A, and finds a list of keywords K in the segment, and enters combinations of (K, A) in an inverted index. This segment, S, is also shared by file B which is encountered when the engine scans file B. Now, even if the engine knows that it has already processed this segment, the engine nevertheless cannot skip the data because it again needs to read segment S, again find those keywords K, and then enter combinations of (K, B) in the inverted index.