One or more devices in a network may generate a stream (i.e., a continuous flow) of data records, which are processed and stored in a repository. Typically, each data record associates values with data fields, and the values of one or more fields are used to identify the record. Such fields are referred to as keys. The record is a duplicate copy if there is another record that associates the same values with the keys.
In existing implementations, a single index is utilized to filter duplicate copies for the entire stream of records. Unfortunately, such implementations are inefficient, resulting in significant cost with respect to time and other resources without serving any purpose beyond the duplicate filtering. Moreover, the repository is usually partitioned into multiple parts, which may make constructing a single index for the entire stream of records difficult.