Computer-based transaction systems generate data relating to transactions performed using those systems. These data relating to transactions are analyzed to identify characteristics of the transactions. From these characteristics, modifications to the transactions and/or associated marketing may be suggested, or other business decisions may be made.
Computer systems for analyzing data relating to transactions generally access the data stored in a database. After the data has been collected for some period of time, the collected data is added to the database in a single transaction. As discussed, data stored in the database is analyzed and results are produced. The results obtained from the analysis typically represent an aggregation of the data stored in the database. These results are then used, for example, as the basis for various business decisions and are also often stored in a database.
In some cases, the raw data relating to transactions are not retained in the database after they are processed. Such processing of data relating to transactions generally is a form of batch processing. In batch processing, results are not output until all the data is processed. If, for example, each record associated with a batch were stored in the database in a separate transaction, a significant amount of overhead would be incurred by a database management system associated with the database. Similarly, a large volume of data is read from the database in a single transaction to permit analysis on the data. In many cases, the time between a transaction occurring and the generation of results using data about the transaction may be days or even weeks.
If the data relating to transactions are generated by the transaction system continuously, or if a desired time frame for receiving the results of analysis is shorter than the time required to perform batch processing, such batch processing techniques cannot be used. Delays in obtaining results of analysis are often undesirable where the behavior of users of the transactions may change frequently. For example, in a database system for tracking system access information in real-time having frequent changes, it may be unacceptable to have periodic availability of access analyses for security or performance reasons.
Given a continuous source of data relating to transactions, the transaction data may be segmented and processed in a data flow arrangement, optionally in parallel, and the data may be processed without storing the data in an intermediate database. Because data is segmented and operated on separately, data from multiple sources may be processed in parallel. The segmentation may also define points at which aggregate outputs may be provided, and where checkpoints may be established. By partitioning data into segments and by defining checkpoints based upon the segmentation, a process may be restarted at each defined checkpoint. In this manner, processing of data may fail for a particular segment without affecting processing of another segment. Thus, if processing of data of the particular segment fails, work corresponding to that segment is lost, but not work performed on other segments. This checkpointing may be implemented in, for example, a relational database system. Checkpointing would enable the relational database system to implement restartable queries, and thus database performance is increased. This is beneficial for database vendors and users whose success relies on their systems"" performance. To generalize, if a data stream can be partitioned, then checkpoint processing and recovery can be performed.
These and other advantages are provided by the following.
According to one aspect, a method is provided for processing a continuous stream of data. The method comprises steps of receiving an indication of transactional semantics, applying the transactional semantics to the continuous stream of data to identify segments of the continuous stream of data, processing the data in each segment of the continuous stream of data to produce results for the segment, and after the data of each segment of the continuous stream of data is processed, providing the results produced for that segment.
According to one embodiment, the data includes a plurality of records, each record includes a plurality of fields, and the transactional semantics are defined by a function of one or more fields of one or more records of the data. According to another embodiment, the method further comprises a step of partitioning the continuous stream of data according to the identified segments. According to another embodiment, the step of partitioning includes a step of inserting a record in the continuous stream of data indicating a boundary between two segments. According to another embodiment, the record is a marker record indicating only a boundary. According to another embodiment, the record is a semantic record including information related to the transactional semantics.
According to another embodiment, the continuous stream of data is a log of information about requests issued to a server, and the step of applying comprises steps of reading information relating to a request from the log; and applying the transactional semantics to the read information. According to another embodiment, the information relating to each request includes a plurality of fields, and wherein the transactional semantics are defined by a function of one or more fields of information relating to one or more requests. According to another embodiment, the information includes a time at which the request was issued to the server and wherein the transactional semantics define a period of time. According to another embodiment, the method further comprises a step of filtering the log to eliminate information relating to one or more requests. According to another embodiment, the step of filtering is performed prior to the step of applying the transactional semantics. According to another embodiment, the step of filtering includes a step of eliminating information relating to requests associated with spiders. According to another embodiment, the method further comprises a step of filtering the continuous stream of data to eliminate data from the continuous stream of data.
According to another embodiment, the method further comprises an additional step of processing the data in each segment of the continuous stream of data to produce the results for the segment, and after the data of each segment of the continuous stream of data is processed during the additional step of processing, providing the results produced for that segment. According to another embodiment, the step of processing comprises steps of partitioning data in each segment as a plurality of parallel partitions; and processing each of the partitions in parallel to provide intermediate results for each partition. According to another embodiment, the method further comprises a step of combining intermediate results of each partition to produce the results for the segment. According to another embodiment, the data in the continuous stream of data has a sequence, and there are multiple sources of the continuous stream of data, and the method further comprises determining whether data in the continuous stream of data is in sequence; and if the data is determined to be out of sequence, interrupting the step of processing, inserting the data in a segment according to the transactional semantics, and reprocessing the segment and continuing the step of processing. According to another embodiment, method further comprises saving a persistent indication of the segment for which data is being processed; when a failure in the step of processing is detected, discarding any results produced by the step of processing for the selected segment and reprocessing the selected segment corresponding to the saved persistent indication; and when the step of processing completes without failure, providing the outputs produced as an output and selecting the next segment.
According to another aspect, a processes is provided for checkpointing operations on a continuous stream of data by a processing element in a computer system. The process comprises steps of receiving an indication of transactional semantics; applying the transactional semantics to the data to partition the continuous stream of data into segments for processing by the processing element; selecting one of the segments; saving a persistent indication of the selected segment; processing the selected segment by the processing element to produce results; when a failure of the processing element is detected, discarding any results generated by the processing element for the selected segment and reprocessing the selected segment corresponding to the saved persistent indication; and when processing by the processing element completes without failure, providing the outputs produced by the processing element as an output and selecting the next segment to be processed by the processing element. According to another embodiment, the step of applying includes inserting data in the continuous stream of data indicating boundaries between segments of the data.
According to another aspect, a computer system is provided for checkpointing operations on a continuous stream of data in a computer system. The computer system comprises means for receiving an indication of transactional semantics; means for applying the transactional semantics to the continuous stream of data to partition the data into segments; means for selecting one of the segments; means for saving a persistent indication of the selected segment; a processing element for processing the selected segment to produce results; means, operative after a failure of the processing element is detected for discarding any outputs generated by the processing element for the selected segment and means for directing the processing element to reprocess the selected segment corresponding to the saved persistent indication; and means, operative after processing by the processing element completes without failure, for providing the results produced by the processing element and selecting the next segment to be processed by the processing element. According to another embodiment, the means for applying includes inserting data in the continuous stream of data indicating boundaries between segments of the data.
According to another aspect, a method is provided for processing a continuous stream of data. The method comprises receiving an indication of transactional semantics; applying the transactional semantics to the continuous stream of data to identify segments of the continuous stream of data; and inserting data in the continuous stream of data indicating boundaries between the identified segments of the continuous stream of data.