The present invention relates to database management systems, and more particularly to database query handling.
Commercial database systems accumulate data through OLTP (On Line Transaction Processing) operations. FIG. 1 illustrates a number of OLTP clients 110 each of which may be, for example, a cash register in a company store. Each sale of a particular product may generate a record such as:                <cashier's name; cash register ID; store location; product ID; price per item; quantity sold; time of sale; . . . > (1)This record is temporarily stored in Operation Data Store (ODS) 120 in a row-oriented structure that reflects the atomicity of the collection method. Thus, the record's different fields are stored sequentially in computer storage, e.g. on a magnetic disk. ODS 120 periodically flushes its records to a persistent storage repository provided by a host database system (HDS) 130. Data may be reorganized on HDS 130 to provide data warehouse (DW) functionality, i.e. speedy generation of responses to complex analytic queries (generation of such responses is referred to as OLAP, or On Line Analytic Processing). The reorganization process may involve normalization to standardize representational references. Normalization may take an ODS structure containing a main transaction table (i.e. the table of records (1) stored in rows) and a small number of lookup tables and generate dozens of targeted tables. A targeted table contains more than one, but less than all, of the attributes of records (1). Each record (1) retains a row structure and maintains its referential integrity in HDS 130, but is distributed over multiple tables in non-atomic form. Multiple joins are later needed to reconstruct an individual record. One factor contributing to the cost (time) to execute join operations is reading the data (IO) from the disk into operating memory (e.g. semiconductor memory) and scanning the rows. Computational costs are typically an order of magnitude less than the IO cost for a given analytic or reporting query. Modern HDS's use a variety of techniques to tune the performance of join operations, principally relying on keys and indexes to isolate value references in a field in one table to their definitions in another table.        
The ordinary normalization process leads to inefficiencies for analytical purposes, where it may become necessary to read data from a large number of tables in order to calculate aggregated or fully qualified results. Consequently, many companies create a third representational form of the data by reorganizing the contents of the DW into multiple data marts (DMs) using star schema structures to reduce the number of tables involved in join operations. While this approach has benefits that make it more efficient than attempting to report against the DW, inherent problems remain which limit the usefulness of the DM approach in the presence of divergent reporting requirements and increasing data volumes. Among these are that the rows in star schema dimensional and fact tables tend to be wide, and the number of rows in the fact tables tends to grow over time. Thus, the IO problem remains a constraint limiting the usability of star schema solutions.
The use of multidimensional representations (OLAP cubes) addresses these limitations to some degree, by precalculating aggregated results of interest and storing them in atomic form. Immediate results may often be acquired from cubes, but introduce additional costs that mitigate their value. Since the data is stored in aggregated results form in the cube, secondary query operations are required to retrieve the detailed atomic data that underlies the cube elements. This may require that DMs still be retained, which therefore expands the number of data representations to four. Second, the cube is limited in the scope of analytics that can be retrieved to those which it calculates when it is constructed. This restricts business intelligence access to timely and changing answers to important business questions and often leads to a proliferation of cubes within an organization. Third, the maintenance cost to prepare and update the cubes introduces additional overhead within organizations, further increasing costs and latency concerns for timely reporting.
Data searches can be accelerated using a columnar (column-oriented) data representation. See U.S. Pat. No. 7,024,414 issued Apr. 4, 2006 to Sah et al. In such representation, each attribute's values are stored sequentially in a file or in physical storage. For example, for the database (1), sequential storage can be used to store the store ID's, cities, states, etc. See FIG. 2. Searches and other operations are fast on each attribute because fewer disk read operations are needed to fetch the attribute's values and also because the operating memory references are localized during the search. In addition, high data compression ratios are easier to achieve.
Consider an example that demonstrates the cost savings when analyzing columnar vs row-wise data storage. According to the 2000 US census, there are approximate 300 million people in the United States, where each person is represented with a demographic record in a single table in the census database. The data includes many fields, such as age, income, state, whether the person rents or owns his place of residence, the person's family size, zip code, employment information, etc. For this example let us assume that each record is 1000 bytes. The database therefore contains about 300 GB (gigabytes) of detail data. We can write an SQL query to calculate and return the average ages of residents of each of the 50 states with the following simple query:                SELECT AVG(Age), State from Census GROUP BY State;In a traditional row-wise HDS, the entire contents of the table may be scanned, and the information in the Age (4 bytes) and State (2 bytes) fields will be extracted. However, in a columnar database (CDB), i.e. in the database in which each attribute is stored in columnar form, only the two columns relevant to the query will be read from disk, totaling 0.9 GB of data (assuming a conservative 50% compression), or a reduction of 99.7% of physical disk IO. An HDS with a sustained effective IO rate of 500 MB/sec will require 60 seconds to read the data for this query. Because CDB data is stored by column where consecutive row values for each field are contiguous, the physical disk IO time is reduced to less than 2 seconds.        
As described in the aforementioned U.S. Pat. No. 7,024,414, CDB data can be divided among different compute nodes, each with its own storage node. Each storage node stores its data in columnar form. Data may be replicated on adjacent storage nodes to provide security against storage or compute node failures. A query master divides query processing among compute nodes as needed. Efficient selection of compute nodes for each query execution is critical for short execution times.