1. Technical Field
This invention generally relates to duplicate documents. Specifically, this invention relates to efficient duplicate detection on web-scale data in supercomputing environments.
2. Description of Background
Identifying duplicate records is typically termed duplicate detection. Duplicate detection is a key operation if dealing with large volumes of data, and especially if integrating data from multiple sources. Web-scale data is data that may be used within Internet resources such as web-sites, servers, or similar resources. There are multiple reasons for the presence of duplicate data on the Internet, including, mirroring, versioning, different formats (e.g., html format, portable document format, etc.), user-copies, backups, and error-pages (e.g., soft 404 errors). The duplicate data results in a significant portion of the Internet having duplicate content.