1. Field of the Invention
The present invention generally relates to updating data storage systems. Particularly, updating a data warehouse maintaining databases with complex dependencies by means of an input user-defined relationship associating partitions among the databases. Additionally, by means of an input user-defined relationship associating multiple temporal values of a database with the partitions of that database.
2. Brief Description of the Related Art
Communications network administrators, particularly in the field of telephony, are known to process data streams such as network traffic traces, system logs, transaction logs, financial tickers, sensor data feeds, and results of scientific experiments. In the streaming model, raw data files are generated continuously in an append-only fashion, with the processing entity having little control over the length, arrival rate, and arrival order of the input data stream containing new data items. Furthermore, each data item is associated with a timestamp, typically representing its generation time as recorded by the source.
In order to handle real-time processing (such as through queries) over high-speed data feeds, Data Stream Management Systems (DSMSs) restrict the amount of accessible memory. Generally, a telecommunications company executes queries over live network traffic by splitting the stream into contiguous and non-overlapping time windows, each spanning no more than a few minutes. When a window ends, the answer is streamed out, the contents of the window and any temporary state are discarded, and computation over the next window begins a new. For instance, a query may track per-client bandwidth usage over each time window. In addition to non-overlapping windows, other DSMSs allow queries to reference “sliding windows” of recently arrived data. At any time, a sliding window of length w contains data whose timestamps are between the current time and the current time minus w. Still, the sliding window size is bounded by the amount of available main memory as the cost of disk I/O could prevent the system from keeping up with the stream.
Rather than performing light-weight computations on-line and discarding data shortly thereafter, a data stream warehouse or data storage system accumulates historical data for complex off-line analysis. A telecommunications company will typically collect and store terabytes of IP traffic summaries, records or streaming results of queries over the live network and system logs produced by network elements reports or router alerts. The method and system of storing such data in a data storage system is commonly referred to as a “data warehousing”. Historical data are used for monitoring, troubleshooting, forecasting, as well as detecting patterns, correlations and changes in network behavior. For example, a network engineer may want to correlate router error messages with changes in the amount or nature of traffic passing through the router immediately before an error was reported.
Querying and updating massive databases is a fundamental challenge of updating and maintaining a data warehouse (also referred to more generally herein as a data storage system). Typically, in addition to storing raw data files, in a raw database, the results of queries (derived data files) are stored in derived databases. These derived databases can have complex dependencies to one or more other databases, which makes them difficult to update.
There is therefore a need for an efficient method and system of generating, updating and maintaining a data storage system. Such a method and system preferably takes advantage of the timestamps or temporal values associated with the input data stream, the stored raw data files and the derived data files. Preferably, such a method and system is capable of updating raw data files as well as complex dependant derived data files without recompiling entire databases, tables or files.