In general, software applications read data by sending file requests to a file system or file system interface. Files often contain portions that are duplicates of other file portions. As a result, applications may read and process duplicate files or duplicate regions within the files multiple times. Unfortunately, reading and processing duplicate files or regions within the files increases disk usage, processing power, and memory consumption.
Recently, file systems have been used to deduplicate files and content in order to detect identical files or identical portions of a file. Identification of identical portions of one or more files can be used to maintain a single copy of the content instead of maintaining multiple copies of the same content. Thus, duplicate files or regions within files may be reduced to a single footprint instead of multiple footprints, thereby reducing storage requirements. Deduplication has therefore been used to reduce memory storage requirements within a file system.
Unfortunately, file system deduplication processes keep internal the information regarding which files and which file portions are duplicated. They do not provide this information to outside applications. Thus, if an application needs to access two files having duplicate content, the application will perform multiple reads and processing on the data thereby wasting valuable resources, e.g., processing power, network bandwidth, disk reads, etc.
For instance, deduplication processes may be performed in order to remove duplicate files from a backup repository. In some instances, failing to track the files or portions thereof that contain the same content causes backup applications to read those files or portions thereof even though the file system has already read, processed and deduplicated those files or portions thereof. Unnecessarily deduplicating and reading files that are copies unfortunately waste valuable resources of the system.