The invention is applicable to many data processing and communications systems, including publish/subscribe messaging systems in which subscribers can specify a subset of published messages that they wish to receive, and including publish/subscribe messaging systems in which transmitted messages are retained for retransmission or analysis.
Publish/subscribe communications involve information producers publishing information or events to a publish/subscribe system, and information consumers subscribing to particular categories of information or events and receiving relevant publications from the system. The publish/subscribe system may comprise a message broker, located between publisher and subscriber applications, which delivers published information or events to all interested subscribers.
The publish/subscribe communication paradigm supports many-to-many communications in which individual publishers and subscribers may be anonymous to each other (communicating via an intermediate broker) and can be easily added and removed from the network without disruption. An example message broker is the IBM® WebSphere® Business Integration Message Broker product available from IBM. (IBM and WebSphere are registered trademarks of International Business Machines Corporation.)
Many publish/subscribe messaging systems are subject-based. In these systems, each message belongs to one of a predefined set of subjects (also known as channels, or topics). Publishers label each message with a subject, and consumers subscribe to all the messages having a particular subject label. For example, a subject-based publish/subscribe system for stock trading may use a defined topic name for each stock issue—publishers post information using the appropriate topic name and subscribers include topic names when specifying which stocks they wish to receive information about.
An alternative to subject-based publish/subscribe messaging is content-based publish/subscribe messaging as described in “An Efficient Multicast Protocol for Content-Based Publish-Subscribe Systems” by G Banava, T Chandra, B Mukherjee, J Nagarajarao, R Strom and D Sturman of IBM T.J. Watson Research Center (and other articles published by IBM Corporation via a Web site at URL www.research.ibm.com/gryphon/). Compared with subject-based systems, content-based systems support greater flexibility for publishers and allow subscribers to express a “query” against the content of messages published. Thus, the limitation to predefined subjects that is a feature of subject-based systems can be avoided by more complex analysis of message content.
Some messaging systems provide a replay feature, for example retaining publications for replay to new subscribers (and newly recovered subscribers) so that the new subscribers are able to receive some or all of an earlier message feed. One such system is the CodeStreet ReplayService for Tibco Rendezvous/Java™ Messaging Service (www.codestreet.com). (Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.)
In particular, some messaging systems enable subscribers to request a replay and to specify a sampling interval for the replay. For example, a subscriber may not require all previous messages and may specify a requirement to only receive every Nth (e.g. 10th) message or to only receive a message once every M (e.g. 10) seconds.
Sampling is particularly useful in situations of data overload where the amount and frequency of data being transmitted (replayed) is such that it is near impossible to process the data quickly enough (either in terms of computation or human interpretation). One such example is a ticker tape of stock prices. By way of another example, the recipient may be interested in data variance of river height data or seismic activity (for instance, data that is above a certain value or where there are large changes in value).
A potential problem with such sampling of messages is that the sample may be unrepresentative of the sampled message feed, and this problem is emphasised if the number of messages transmitted in a sample period fluctuates and there are sparse periods. For example, if the sampling method periodically transmits the ‘last received message’ and there is a gap in the message feed spanning multiple sampling intervals, a single ‘last received message’ will be repeatedly sampled. There may be no way for a user to determine whether the repeats are valid results or whether an error has occurred (such as a connection failure or sampling inaccuracy).
This problem is best illustrated using FIG. 1. Arrow t shows the progression of time. A data feed 20 of messages is replayed. Each message 1-10 replayed as data feed 20 is shown above the arrow. A user however does not require all messages from the replay but only a pre-specified sample. For example, samples could be taken at 10 second intervals. FIG. 1 shows each sample a-j of the data feed depicted below the line.
Using a basic “last received message” algorithm, the sequence of messages returned as a result of such sampling would be 2, 3, 3, 6, 6, 6, 7, 7, 9 and 10. It can be seen from this that some messages are sampled two or more times. This occurs when no new messages are received between sampling intervals. For example message 3 is the last message received when sample b is taken and because no new messages are received between sample b and sample c, message 3 is once again the last message received when sample c is taken.
Consequently, it should be appreciated that this sample completely distorts the actual picture, that is unless the receiving application is configured to recognise and deal with duplicate data. Indeed, depending on the messaging transport, it may not even be possible to distinguish between two different messages that happen to have the same data, and the same message delivered twice.
Distorted results are particularly a problem when a recipient application wishes to perform some analysis on the sample. If the sample is unrepresentative of the actual data feed, the results gathered by the recipient application may have very little meaning.
As discussed, sampling of fluctuating data feeds having sparse periods is particularly problematic, but it can also be difficult to obtain meaningful data from a populous data feed.