This invention relates to an improved database processor and system architecture.
1. Field of the Invention
Relational database management systems (DBMSs) are used throughout organisations for maintaining data that changes on a daily basis recording individual transactions as they occur. These databases are called operational databases and their data processing applications are known as On-Line Transaction Processing (OLTP) applications. DBMSs have been able to perform these functions satisfactorily for some years.
2. Description of Related Art
Increasingly organizations utilize data processing applications which allow them to compare current and past data, enabling them to highlight trends and thus support decision making. In order that the decision support system is effective, many organizations create consolidated data warehouses which collect the data from several heterogeneous operational databases or other information sources and integrate them for retrieval of historical and summary information. This is known as On-Line Analytical Processing (OLAP) and the computations required are complex, involve large volumes of data and typically involve data aggregation.
The functional and performance requirements of OLAP are quite different from those of the OLTP data processing applications traditionally supported by the operational databases.
OLTP applications typically automate clerical data processing tasks such as order entry and banking transactions that are the day-to-day operations of an organisation. These tasks are structured, repetitive and consist of short, atomic isolated transactions which require detailed, up-to-date data, and read or update, a few (tens of) records accessed typically on their primary keys. Operational databases tend to be hundreds of megabytes to gigabytes in size.
OLAP applications in contrast, are targeted for decision support. Historical summarised and consolidated data is more important than detailed, individual records. Since data warehouses typically contain consolidated data from several databases, potentially over long periods of time (for example as indicated in Table I), they tend to be orders of magnitude larger than the operational databases. The workloads are query intensive with mostly ad-hoc, complex queries that can access millions of records.
Most database management systems to-day are based on the relational data model. The central data description construct in this model is a relation which is a set of records, each of which has the same number (and type) of fields. A database is a collection of one or more relations and includes a schema—a description of the data. In the relational model, the schema for a relation specifies its name, the name of each field (or attribute or column) and the domain (type) of each field. For example, a relation called Product may have the following schema: Product(Product—ID:string, Type:string, Day:string, Month:integer, Year:integer), which denotes that each row in the relation has four fields, with field names and types as indicated. An example instance of the Product relation would be:
Product—IDTypeProduction—YearP1Shampoo1997P2Soap1998P3Conditioner1998
Relational systems allow a wide variety of queries to be posed easily. An example of the type of query the relational system enables are user to find is:                Find the products which have been produced in 1998        
The most widely used commercial relational database language is the Structured Query Language (SQL). The above query can be posed to the database through ANSI/ISO SQL-92 in the following form:    SELECT P.Product—ID, P.Type, P.Day, P.Month, P.Year    FROM Product    WHERE P.Year=1998The abbreviation of P stands for the full relation name (Product).The result relation of the above query is:
Product—IDTypeProduction—YearP2Soap1998P3Conditioner1998
For OLAP applications a recently proposed multidimensional conceptual model may be used in preference to the tabular model (Table I) of the relational system. Data are represented to the users as a multidimensional Cube. The multidimensional model can be associated with the Sales relation shown in Table I. Each domain is classified either as a Dimension or a Measure of Interest. In this example, the measure of interest is Total—Sales and the dimensions are PRODUCT—ID, STORE, MONTH.
The fundamental operation in OLAP applications is aggregating a measure over one or more dimensions.
In the cube those domains considered as dimensions are placed on the sides of the cube and the measure of interest (Total—Sales) is the content of each cell of the multidimensional array as shown in FIG. 1.
The cube data model introduced new conceptual operations. The user can ‘rollup’ the cube to obtain a less detailed more abstract view of the data. In the opposite process the user can ‘drill-down’ expanding the representation from the total to a further level of detail. ‘Slice’ amounts to equality selection and ‘dice’ amounts to a range selection. ‘Rotate’ is when the view is rotated 90 degrees clockwise without any rearrangement of the data.
A dimension may be associated with a hierarchy. For example the TIME dimension could have the following hierarchy:
YEAR -------> QUARTER --------> MONTH --------> DAY
The multidimensional operations can utilize such hierarchies, e.g., Rollup of a TIME dimension can be done to each of the above levels as required by the user.
Aggregation which is the fundamental operation in OLAP transactions is classified into Scalar aggregates and aggregate functions. Scalar aggregates calculate a single scalar value from a unary-input relation, e.g., the max value of an attribute in a relation. An aggregate function takes a binary input relation, e.g., Total of Sales in each country. The aggregate functions take relations as input and return relations as an output. They are classified into Distributive, Algebraic and Holistic.
For example, the users of such a system typically wish to obtain information of much larger granularity than a single sale, e. g, The Stores' performance per month, irrespective of the individual product.
This would require as regards the relation in Table I aggregation of Total—Sales over the STORE and TIME dimensions. It can be expressed using the following SQL statement
SELECTStore, Time as (SUM) SalesFROMSalesGROUP-BYStore, Timeto give the result shown in Table II.
There are three disadvantages in the SQL/relational approach. The first is that many different aggregations are possible. In typical practical cases, the Sales relation of the above example may well have many more domains than shown in the example and each different aggregate requires a distinct SQL program. The second disadvantage is that it is by no means easy to visualise the computations conceptually. Finally, most users to-day expect to manipulate their data graphically.
The user interface to the Multidimensional data model (Cube) allows the user to specify the conceptual multidimensional operators graphically, for example as shown in FIG. 2.
Users in OLAP systems typically require to investigate many of the different possible aggregates. Given a measure of N dimensions there are 2N possible aggregates (the above example with three dimensions giving rise to 8 aggregates). It is important that users can specify and visualize the different levels of aggregation with ease. Typically this would be done in real time using a graphical user interface operating with the Cube-by operator to explore the multidimensional environment. The Cube-by operator is the n-generalization of simple aggregate functions and it computes Group-bys corresponding to all possible combinations of a list of attributes.
With the introduction of the Cube-by operator it became easier for users to specify and visualize different levels of aggregation. However, the processing required to support this operator is enormous in terms of time and storage. Particularly where hierarchies are involved the effective number of possible combinations becomes unmanageable.
Many workers have attempted to approach the implementation of the Cube-by by directly implementing Materialized Views which are precomputed for later retrieval. This is because it has been judged that the computation of aggregates ‘on the fly’ (i.e., on demand) would be too slow. Though significant research has been carried out to optimize the Materialized View approach, the method is inherently unsatisfactory. First there are a large number of Views. This implies a long precomputation time and precludes even modest update. Storage of the Views requires storage capacity which is possibly more than an order of magnitude greater than that required for the original data.
For the highest aggregations, the pre-computation approach is economical in the additional storage resources consumed, while providing gratifyingly rapid response. The low-level aggregates however are large and numerous, resulting in a many-fold expansion of the original relation. They virtually always require further processing, e.g., restriction, etc., to extract the tuples of interest. The most significant disadvantage, however, of the pre-computational approach is that it is not suited to dealing with update, other than as a batch-type operation separate from its interactive use. While there are undoubtedly situations in which this is acceptable, it nevertheless represents a major restriction on the system performance envelope for this approach.
The major disadvantage of a purely direct computational approach (i.e., on the fly) is the work which must be undertaken for each operation, and hence the much greater delay given equivalent systems performance parameters.
A partial-computation approach is possible but has been a subject of research by many workers mainly because to select the materialized views is not trivial. Systems architecture has also been the subject of research to minimize the total query response and the cost of maintaining the selected views, given a limited amount of resources (time, storage).
The above mentioned approaches have their advantages and disadvantages.