The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
A data pipeline system is a distributed data storage system that may include a plurality of datasets and that is programmed or capable of executing specified transformations of raw datasets or derived datasets into other derived datasets. A dataset is a digital representation of a set of files and metadata regarding the set of files. A dataset may store multiple versions of a file as the file is updated.
To provide version control and efficient access, datasets are updated using a transaction based system. Transactions allow for discrete updates to a dataset. When a transaction is opened, the server computer system grants permission to one or more computing devices and/or programs to update a dataset. While the transaction is open, updates may be received for the dataset, but are not viewable until the transaction is committed. When the transaction is committed, the state of the dataset is finalized and the dataset is viewable to other users and programs. A version number of the dataset is incremented to indicate that a new version of the dataset has been stored. Two-phase commit is a common technique in relational database systems for ensuring that only committed changes become usable.
Datasets may be modified by different types of transactions. A snapshot transaction replaces all files in the dataset with a new version or stores a snapshot of all the current files in the dataset. An append transaction adds a new version to an existing file or existing files in a dataset. An update transaction adds a new file to a dataset. Each transaction is identified by the version number and/or a transaction identifier.
When a query is received for data from the dataset, the server computer system uses the transactional nature of the system to build a dataset. First, the server computer system identifies a snapshot of the dataset. Second, the server computer system identifies each transaction that occurred after the snapshot of the dataset. Finally, the server computer system builds the dataset by applying the identified transactions to the identified snapshot of the dataset. By storing updates as completed transactions, the server computer system can effectively roll back the dataset to any transaction identifier.
While transaction based updates are useful for efficiently responding to queries and performance of version control, the transaction based nature is less effective for low-latency data requirements. Some types of datasets, such as financial market pricing datasets, are most effective when they are updated frequently as prices may change rapidly. Ideally, an application accessing financial market pricing datasets would receive real-time values, thereby effectively matching the values queried from the dataset to the current financial market prices.
While a transaction is open and not fully committed, the updates in the transaction are not visible to other applications or computing devices. Thus, to use a transaction based system to provide low-latency data, a server computer system must open transactions, update data, and commit the transactions at the speed at which the data changes, thereby requiring the opening and committing of a relatively large number of transactions in a short period. The opening and closing of many transactions to deliver real-time data is extremely computationally expensive and inefficient.
Thus, there is a need for a technique for handling streaming updates while maintaining the benefits of a transaction based system.