The present disclosure relates generally to data processing, and more specifically to data processing relating to database aggregation operations.
With the advent of technology, data processing is becoming increasingly important as it has become widely used in various industries such as information management, statistics, business, and finance. Database technology is a fundamental part of data processing especially in supporting information storage and management. Database management is challenging, however, as there has been a constant increase in amount of stored data or applications associated with data warehousing. One particular challenge is in performing aggregation operations on massive amount of data, particularly those that are labeled as historical data.
Aggregation operation is a basic operation of the data warehousing applications and is used in a variety of manners such as to perform statistical operations, reporting requests and even data mining. Some common types of aggregation operations include SUM, AVG, MAX, MIN, COUNT, and other similar important and basic functions that are used on a daily basis. For example, a certain bank needs to count the number of transactions relating to individual funds where the transaction amount exceeds 30,000 dollars. The query relates to all such transactions in the past 3 years. The inquiry can be further expanded to relate to both the maximum transaction amount and/or to the average transaction amount. Traditionally, to handle such a request a common processing method involves generating a query statement and then scanning all the rows of a database matrix in order to satisfy the requirement relating to data stored in the historical data bank relating to the past 3 years. Then the maximum value and average value are calculated thereof to satisfy the expanded inquiry conditions. Since the number of transactions of a bank is extremely large, the amount of stored historical data is necessarily quite large. Consequently, the amount of time and resources that need to be allocated to the task is substantial. Not only does it take a long time to run the data query on the large amount of existing records and to extricate the ones that satisfy the condition from the massive historical data bank, but specified requested calculations need to be performed as well which adds to the time and effort to complete the task. These types of tasks can consume anywhere between several hours to several weeks to complete depending on the amount of data in storage.
Some tools such as multi-dimension database (MDDB) can be used to help alleviate the above mentioned issues. MDDB has some advantages as compared to relational databases. For example in an MDDB when it is known that the key value combinations of data columns will be accessed in a relatively uniform manner, it is possible to improve the data processing speed and response time by improving the query efficiency. However, there are tradeoffs when using MDDB. For example, one disadvantage of the MDDB is that it needs to store all possible combinations to cover all data records to which the query statement is possibly related. The storage requirement for accomplishing this task can be both tremendous and costly. Therefore, this tradeoff in many instances does not seem worth. This is because the instances where the number of times (or the probability) where key value combinations need to be accessed in a uniform manner seldom occur.
Another possible solution for reducing query time involves calculating certain functions beforehand. This is only useful with respect to a predefined particular query statement. However, the instances where such a condition presents itself is not frequent and the improvements are not large. In addition, the one obvious disadvantage of this solution is that the improvement only occurs in instances where the operations or calculation can be performed beforehand and with respect to a predefined particular query statement. This solution cannot be performed with respect to a variety of random query statements, especially those that require data statistics, analysis and mining. In addition, it is difficult to predict or predefine all particular query statements for an operation beforehand.