The present disclosure relates to hardware efficient fingerprinting. In particular, the present disclosure relates to a pipelined hardware architecture for computing fingerprints on high throughput data.
As data increases rapidly, identifying and reducing the redundancy in the storage, transmission, and processing of data has become more and more important. One of the common techniques used in identifying redundant data is comparing sketches of data chunks to find duplication or similarity. To illustrate, Rabin fingerprints have proved to be effective and are widely used in the detection of data duplication and similarity. To get a sketch for a data chunk using Rabin fingerprints, the data is scanned using a fixed size window, e.g., 8 bytes long, that rolls one byte ahead every step. The data within the window, called a “shingle,” is used to calculate a Rabin fingerprint. This process continues until the chunk of data is finished. During and after the scanning, the fingerprints are sampled to form a sketch for the data chunk. This algorithm is suitable for data de-duplication in off-line data backup and archive applications, but demands intense computation when working at wire speed for streaming data.
With storage devices approaching gigabyte per second throughput and sub-millisecond latency, software approaches to fingerprinting are inadequate for real-time data processing without committing a huge amount of computing power which may impact performance and resource utilization. In view of the foregoing, it may be understood that there may be significant problems and shortcomings associated with current technologies for generating fingerprints and deduplicating data.