In systems operating with streams of data items, in particular data items stored as messages (or parts of messages) in a message stream, it can be important that duplication of data is avoided. For example, in financial transactions, duplication of data items in transactions can result in the application of a transactional operation multiple times. Where such an operation is not idempotent (in that the result is not the same after multiple applications of the operation) undesirable or unintentional effects can result. It is therefore essential in such systems that duplication of data items is at least detected.
Existing techniques for identifying duplicates will normally involve comparing a new data item received in a stream of data items to a list of all data items received to identify a duplicate. If the new data item is not a duplicate of any item in the list then the new data item is determined not to be a duplicate and it is added to the list. Thus the list needs to be interrogated for all new data items received in a stream, including data items that are not duplicates. Further, the list of data items is continually growing and, consequently, continually consuming more resource. As the list grows, the process of comparing all new data items to the list becomes increasingly resource intensive due to the resource required searching a continually growing list. Also, since all received data items need to be checked against the list, the resource overhead of checking a continually growing list affects each and every data item received.