Information management in a large enterprise (e.g., company, educational organization, government agency, etc.) has become increasingly complex due to the explosive growth of the number of electronic documents that are typically stored in various machines in the enterprise. In addition to maintaining electronic documents that are actively used by personnel in the organization, information management also has to address electronic documents that are stored for backup or archival purposes.
In some cases, it may be desirable to identify files that are similar to other files. An enterprise typically includes a relatively large number of client computers and a smaller number of server computers. One or more of the server computers can be designated to perform centralized data collection and processing, including processing to find similar files. The approach of using server computers to perform processing to find similar files is referred to as a “server-centric approach,” where files from client computers are provided to one or more designated servers for scanning and processing. However, such a server-centric approach can lead to overloading of the one or more server computers, which can result in reduced efficiency. Moreover, providing files from client computers to the central computers also can lead to points of vulnerability that increases the likelihood of leakage of sensitive and proprietary information.