A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
The present invention relates generally to data processing environments and, more particularly, to system and methods for improved indexing and processing of date information present in data records in those environments.
Computers are very powerful tools for storing and providing access to vast amounts of information. Computer databases are a common mechanism for storing information on computer systems while providing easy access to users. A typical database is an organized collection of related information stored as xe2x80x9crecordsxe2x80x9d having xe2x80x9cfieldsxe2x80x9d of information. As an example, a database of employees may have a record for each employee where each record contains fields designating specifics about the employee, such as name, home address, salary, and the like.
One purpose of a database system is to answer decision support queries and support transactions. A query may be defined as a logical expression over the data and the data relationships set forth in the database, and results in identification of a subset of the database. For example, a typical query might be a request, in SQL syntax, for data values corresponding to all customers having account balances above required limits. During query processing, a database system typically utilizes one or more indexes to answer queries. Indexes are organized structures associated with the data to speed up access to particular data values (i.e., answer values). Indexes are usually stored in the database and are accessible to a database administrator as well as end users. The basic operation of database systems, including the syntax of SQL (Structured Query Language), is well documented in the technical, trade, and patent literature; see, e.g., Date, C., An Introduction to Database Systems, Volume I and II, Addison Wesley, 1990; the disclosure of which is hereby incorporated by reference.
xe2x80x9cData warehousexe2x80x9d systems represent a type of database system optimized for supporting management decision making by tracking and processing large amounts of aggregate database informationxe2x80x94that is, the data warehouse. Data warehouses contain a wide variety of data that present a coherent picture of business conditions at a single point in time. Development of a data warehouse includes development of systems to extract data from operating systems plus installation of a warehouse database system that provides managers flexible access to the data. A well known example of a data warehouse system is Sybase(copyright) Adaptive Server(copyright) IQ (ASIQ), available from Sybase, Inc. of Emeryville, Calif.
The desire to support fast, ad hoc query processing in large data warehouse applications has led to the development of novel indexing schemes. Within ASIQ, for instance, there are several special-purpose indexing structures, including FastProjection (FP), LowFast (LF), HighNonGroup (HNG), LowDisk (LD), and HighGroup (HG). Each will be briefly described in turn.
FastProjection methodology entails vertical partitioning of the data into a single column stored as an array of data, where each cell in the array is as wide as the data column and the number of cells in the array matches the number of rows in the table. This index is used to provide fast access to the original data value given a row number. In LowFast methodology, an index is employed which comprises a B-Tree together with a set of bitmaps. The B-Tree has one node for each unique value in the column and each node has a bitmap associated with it. The associated bitmap has the nth bit on if the nth row of the table contains the value in the B-Tree. This index is used to provide fast access to a set of row numbers given a value among a small set of distinct values (under 1000). Without further optimization or compression, the technique requires a fair amount of disk space to do this.
In HighNonGroup methodology, an index which comprises a set of bitmaps is employed. The number of bitmaps in the set (i.e., 8xc3x97width-in-bytes) depends on the maximum width of the data in the column. The value for each row in the table is broken into its component bits. The nth bitmap has the mth bit on if the nth bit of the value (taken left to right) at the mth row of the table is on. This index is used to provide fast access to a set of row numbers given a range of values among a large set of distinct values (over 1000), when query grouping operations are not needed. It uses a moderately small amount of disk space to do this.
In the LowDisk methodology, an index which comprises a B-Tree and an HNG index is employed. The B-Tree has one node for each unique value in the column and each node has a small unique integer assigned to it. The small unique integer assigned to the value for each row in the table is broken into its component bits. The nth bitmap has the mth bit on if the nth bit of the small unique integer (taken left to right) assigned to the value at the mth row of the table is on. This index is used to provide moderately fast access to a set of row numbers given a value among a small set of distinct values (e.g., under 1000). It uses a very small amount of disk space to do this but is typically not as fast as the LowFast index.
With HighGroup methodology, an index is employed comprising a B-Tree, an HNG index, and a variation on the FP index. The B-Tree has one node for each unique value in the column and each node has a location in the modified FP index. The embedded HNG index is a normal HNG index and is used to satisfy wide range queries and aggregation functions (i.e., SUM). The B-Tree and modified FP are used to provide fast access to a set of row numbers given a value among a large set of distinct values (over 1000). They are also used to provide efficient handling of grouping queries. The technique uses a relatively large amount of disk space to do this.
For further description of the above-mentioned indexing methodologies, as well as a description of the ASIQ system and architecture, see, e.g., U.S. Pat. Nos. 5,649,181, 5,852,821, 5,794,229, 5,794,228, and 5,924,091, the disclosures of which are hereby incorporated by reference.
During database system operation, query processing depends on the ability to have an index perform comparison operations (e.g.,  less than ,  less than =, =, !=,  greater than , and  greater than =operations) and range operations (e.g., A less than =xc3x97 less than B). The cost of performing any one of these operations directly affects how well suited the index is for resolving a given query. Often, data warehouses track historical (i.e., date-based) information, such as sales information gathered over a period of time. Expectedly, date processing within queries tends to be quite common in the typical data warehouse. Representative examples include the following, shown as sample SQL xe2x80x9cselectxe2x80x9d queries.
select . . . from table where 1_shipdate greater than =xe2x80x981993-07-01xe2x80x99 and 1_shipdate less than xe2x80x981993-10xe2x80x9901
select . . . from table where o_orderdate less than xe2x80x981993-03-10xe2x80x99
These examples may serve to illustrate some of the indexing operations performed on date columns. Choosing an appropriate index for the date column will directly affect how quickly the query can be resolved. For a low number of date values (i.e., low cardinality data), one would typically select an HNG index type. These indexes are good for a wide range of queries, including aggregations, since query results can be obtained by reducing a given query into a sequence of logical AND, OR, and XOR operations performed on these bit vectors or bitmaps.
In gauging query processing efficiency, a good measure of the performance of any given operation (e.g., comparison operation) can be thought of as the number of bitmap-logical operations the system has to perform: the fewer bitmap operations required, the better the system""s performance. Ignoring for a moment the I/O (Input/Output) costs, one can measure the cost of performing the comparison ( less than ,  less than =, =, !=,  greater than ,  greater than =) and range operations for an HNG index on a date column by calculating the number of bitmap operations required to resolve the query. With ASIQ, a date is a 32-bit quantity (e.g., 32-bit integer). Hence, for this sample datatype, a range query requires at the most 64 bitmap operations, and a comparison operation (i.e.,  less than ,  less than =, =, !=,  greater than , or  greater than =) requires at the most 32 bitmap operations. Of course given a large data warehouse containing millions of records, the cumulative effect of these bitmap operations for a given date query may consume substantial system resources and time.
As there is much interest in improving the performance of data warehouse systems, there is a need for a new index structure specifically designed for date columns that will significantly reduce the number of bitmap operations required to do each of these operations and also significantly reduce the amount of I/O required to do each operation. The advantage of such a solution becomes even more pronounced as the number of rows in a given index space growsxe2x80x94a situation commonly encountered with data warehouses. The present invention fulfills this and other needs.
System and methods are described for improved indexing and processing (e.g., query processing) of date-based information. The system provides a specialized date index, the DateIndex, for indexing date-based information through use of bitmap groups or sets. In particular, individual groups are provided for year, month, and day components, respectively.
Each bitmap within a given group may be defined to be an infinite array (bit[ ]) of bits, with any given bit xe2x80x9cixe2x80x9d having a value of 0 or 1, only. A bitmap may be used to identify which rows in a database table have a particular value X. For example, to track rows that have a value 5, as one inserts values into a given database table for each row xe2x80x9cixe2x80x9d that has value 5, one may set bit[i]=1.
A date value may be decomposed into individual year (y), month (m), and day (d) components. Within a database system, dates have the following characteristics.
Range of 12 months maximum
Range of 31 days maximum
Range of N years (N not necessarily known a priori)
A DateIndex index data structure of the present invention is defined to comprise three distinct groups of bitmaps: one for years, another for months, and another for days, respectively. This index can be created on a column having a datatype of xe2x80x9cdatexe2x80x9d (e.g., standard SQL datatype which stores date literals).
Consider, for instance, a database table with a single column C with an accompanying DateIndex index on that column. The month group contains exactly 12 bitmaps (call them M[1], M[2], . . . , M[12]) to track rows in column C containing month ranges for that column. The month group tracks rows as follows.
M[1] tracks rows in table that have dates containing months 1 through 12, inclusive
M[2] tracks rows in table that have dates containing months 2 through 12, inclusive
M[3] tracks rows in table that have dates containing months 3 through 12, inclusive . . .
M[12] tracks rows in table that have dates containing months 12 through 12, inclusive
As shown, each successive bitmap stores a successively smaller subrange of possible month values (e.g., 1-12, 2-12, 3-12, and so forth and so on). M[1] contains a 0 for row xe2x80x98ixe2x80x99 when the value of row xe2x80x98ixe2x80x99 is NULL.
The day group contains exactly 31 bitmaps (call them D[1], D[2], . . . , D[31]) to track rows in column C containing ranges of days for that column. In a manner similar to that described above for the month group, the day group is employed as follows.
D[1] tracks rows in table that have dates containing days 1 through 31, inclusive
D[2] tracks rows in table that have dates containing days 2 through 31, inclusive
D[3] tracks rows in table that have dates containing days 3 through 31, inclusive
D[31] tracks rows in table that have dates containing days 31 through 31, inclusive
Again, each successive bitmap stores a successively smaller subrange of possible values. D[1] contains a 0 for row xe2x80x98ixe2x80x99 when the value of row xe2x80x98ixe2x80x99 is NULL.
Continuing in this manner, the year group contains N bitmaps (call them Y[1], Y[2], . . . , Y[N]) to track rows in column C containing ranges of years. N is the number of distinct years. Suppose a given database table contains dates with years 1997, 1999, and 2000. In such a case, N=3 and one has the following.
Y[1] tracks rows in table that have dates containing years 1997, 1999, and 2000, inclusive
Y[2] tracks rows in table that have dates containing years 1999, and 2000, inclusive
Y[3] tracks rows in table that have dates containing years 2000, inclusive
As before, each successive bitmap stores a successively smaller subrange of possible values, in this case, year date values. Y[1] contains a 0 for row xe2x80x98ixe2x80x99 when the value of row xe2x80x98ixe2x80x99 is NULL. Note for the example dates above (i.e., 1997, 1999, and 2000), there is no bitmap containing the year range starting with 1998 since there are no rows in column C containing date values consisting of that year. In other words, no bitmap is employed to represent the subrange of 1998-2000, as there are no 1998 year date values.
The above description highlights the current embodiment whereby a bitmap is used for every month subrange (1-12) and every day subrange (1-31) regardless of whether the data set has any date values with months (or days) in the range. For example, if the data set only contains data values with months January, February, and April, then the month subrange bitmaps M[5] through M[12], inclusive, are empty.
Alternatively, one could organize the month group and day group like the year group. For example, instead of having 12-month subrange bitmaps, one employs only N where N is the actual number of distinct months in the data set (i.e., January, February, and April in the preceding example). In other words, there is no bitmap containing the subrange March-December as there are no March date values. In this manner, a similar approach may be adopted for the day group. Nevertheless, it is not believed that organizing the month and day groups like this yields any appreciable gains because of the extra processing to access the appropriate date or month group and the fact that, with the currently-preferred embodiment, empty bitmaps occupy minimal space within the database system.
Finally, in accordance with the present invention, an xe2x80x9cexistence bitmapxe2x80x9d (EBM) is used to track whether particular rows exist (with regards to a particular record identifier (RecordID)) in the column C. Here, EBM[i]=1 means row with RecordID xe2x80x98ixe2x80x99 exists; 0, of course, means the row with RecordID xe2x80x98ixe2x80x99 does not exist.
Armed with this information, the system can now resolve a desired query operation using set algebra. For a query predicate involving an xe2x80x9cequal toxe2x80x9d operation (EQ), for instance, the xe2x80x9cequal toxe2x80x9d operation may be reduced to the following.
EQ=(Y[y]xe2x88x92Y[y+1])∩(M[m]xe2x88x92M[m+1])∩(D[d]xe2x88x92D[d+1])
(where xc3x85 indicates set intersectionxe2x80x94logical xe2x80x9candxe2x80x9d operation)
A xe2x80x9cnot equal toxe2x80x9d operation (NE) may be reduced to the following.
NE=EBMxe2x88x92EQ
A xe2x80x9cgreater thanxe2x80x9d operation (GT) may be reduced to the following.
GT=Y[y+1]∪((Y[y]xe2x88x92Y[y+1])∩M[m+1])∪((Y[y]xe2x88x92Y[y+1])∩(M[m]xe2x88x92M[m+1])∩D[d+1])
(where ∪ indicates set unionxe2x80x94logical xe2x80x9corxe2x80x9d operation)
A xe2x80x9cgreater than or equal toxe2x80x9d operation (GE) may be reduced to the following.
GE=Y[y+1]∪((Y[y]xe2x88x92Y[y+1])∩M[m+1])∪((Y[y]xe2x88x92Y[y+1])∩(M[m]xe2x88x92M[m+1])∩D[d])
This can be simplified to the following using set algebra distributive laws.
GE=Y[y+1]∪((Y[y]xe2x88x92Y[y+1])∩(M[m+1]∪((M[m]xe2x88x92M[m+1])∩D[d])))
The xe2x80x9cless thanxe2x80x9d (LT) operation is defined in terms of the preceding ones, as follows.
LT=EBMxe2x88x92GE
The xe2x80x9cless than or equal toxe2x80x9d (IE) operation is defined in terms of the preceding ones.
LE=EBMxe2x88x92GT
Range operations, such as Date1 greater than X and Date2 less than Y, are simply a combination of the preceding operations as indicated below.
Date1 greater than =X and Date2 less than =Y: GE∩LE
Date1 greater than =X and Date2 less than Y: GE∩LT
Date1 greater than X and Date2 less than Y: GT∩LT
Date1 greater than X and Date2 less than Y: GT∩LE
By providing a new index structure specifically designed for date columns, the present invention significantly reduces the number of bitmap operations required to do each of these operations and also significantly reduces the amount of I/O required to do each operation.