File systems (such as file servers in data warehouses or other data repositories) can store large numbers of files. Various management and/or processing tasks can be performed with respect to files contained in the file systems. For example, it may be desirable to identify duplicate files or to identify older or obsolete versions of files, which take up valuable storage space. Also, the presence of duplicate or obsolete versions of files tends to increase processing times when files are being retrieved or manipulated. As another example, if the processing task is a search task, then duplicate or obsolete versions of files can be output as search results, which tends to pollute the list of search results, and can be troubling the end user.
Generally, conventional techniques of identifying duplicate or similar files tend to be inefficient.