1. Field of the Invention
The present invention relates generally to an improved data processing system, and in particular, to a computer implemented method for removing redundant data. More particularly, the present invention relates to a computer implemented method, system, and computer usable program code for parallel data redundancy removal.
2. Description of the Related Art
Diverse application domains, such as telecommunications, online transactions, web pages, stock markets, medical records, and telescope imagery generate significant amounts of data. Removing redundancy in data assists in resource and computing efficiency for the downstream processing of such data. Many application domains aim to remove redundant data records in real-time or near real-time from data flowing at high rates. A data record may be considered redundant for a set of data if another data record exists in the set of data which exactly or approximately matches the data record.
For example, each telephone call in a telecommunication network generates call data records that contain details about the call, such as the calling number and the called number. Errors in call data record generation may produce multiple copies of a call data record. Removing duplicate copies of call data records may assist in resource management prior to storing the call data records. Removing duplicate data records using database accesses may be a slow and inefficient process. Pair-wise string comparisons may make real-time or near real-time redundancy removal prohibitive for large numbers of records.