1. Field of the Invention
The present invention concerns databases in general, and, in particular, a method for access and display of data from large data sets in which the large data sets are portioned into a plurality of buckets, each of which contain a small subset of records in the large data set.
2. Background Information
During the past decade, the use of databases that store very large amounts of data has become increasingly prevalent. This is due, in part, to the availability of both computer hardware and database software resources to support these large databases. Prior to the present widespread availability of these resources, data was typically stored using “flat-file” databases or data systems running on mainframes or standalone computers using storage schemes that either supported limited sizes, or provided limited real-time access to the data (e.g., data systems that stored data on tapes). Oftentimes, an enterprise organization had various types of data stored on separate machines that did not provide for easy remote access. In contrast, today's IT environments often involve the use of huge centralized repositories in which data for an entire enterprise are stored, wherein the repositories may be accessed from remote clients connected to them via LANs, WANs, or even over the Internet.
The majority of the large databases in operation today are RDBMS (relational database management system) databases. Furthermore, most of these RDBMS databases are SQL-(structured query language) based databases. These SQL databases run on database software provided by various vendors, including Oracle (Oracle 8i and 9i), Microsoft (SQL Server 7), IBM (DB2), Sybase, and Informix. RDBMS databases are usually run on one or more networked database servers and are accessed by client machines connected to the database server(s) via a computer network. On some n-tier architectures, there are one or more middle tiers (e.g., application servers) that sit between a “backend” database server and the clients. A typical 2-tier architecture is illustrated in FIG. 1, wherein a database 10 hosted by a database server 12 may be accessed by a client machine 14 via a network 16. Modern SQL RDBMS databases enable a multitude of client users to concurrently access (i.e., Insert, Update and Delete) data using appropriate client-side software (or through middleware running on an application server), such as client applications that provide a graphical user interface (GUI) that allows users to interactively access database data.
In RDBMS databases, data are stored in tables in accordance with a database “schema,” which defines storage parameters (metadata) that define structures for each of the tables stored in the database, various data constraints, relationships between tables, and indexes, etc. Generally, the table data are stored in one or more shared storage spaces, each comprising one or more datafiles, although there are some RDBMS databases that store data for each table in a separate file, and others that store all of the data for a given database in a single file. Under Oracle's architecture, these shared storage spaces are called “tablespaces.” Typically, an Oracle database will include a plurality tablespaces (e.g., system, user, rollback, etc.), wherein user data are stored in a plurality of tables in one or more selected tablespaces that may be specifically configured using various configuration parameters during creation or alteration of the tablespaces.
Data are stored in a tablespace in the following manner. First, a plurality of segments are allocated for the tablespace and a datafile is assigned to the tablespace, wherein each segment comprises a plurality of fixed-size (e.g., 2K, 4K or 8K) storage blocks. Each of these storage blocks comprise multiple operating storage blocks, which are the base unit the operating system uses to define where data is physically stored. These storage blocks are (typically) filled with data in a substantially sequential manner until close to all of the storage space provided by the allocated segments is consumed. At this point, the tablespace must be “extended” with one or more extents (similar to segments) so that additional data may be added. Typically, the data is logically stored in a plurality of rows, wherein each row of data includes a set of data (i.e., record) pertaining to various columns defined for the table the data are stored in. In Oracle, additional “row ID” information that uniquely identifies every row in the database is stored for each row. The row ID comprises a string having an encoded format that can be parsed by Oracle to quickly access the row of data corresponding to the row ID. The row IDs are used by primary key and other types of indexes to speed up queries and sorting operations.
As discussed above, the data are stored in a substantially sequential manner based on the approximate order the data are entered into a database. (For example, Oracle uses a background operation to write blocks of data in response to predetermined conditions (memory full, time interval, etc.), wherein the block writes are performed continuously, although they are slightly asynchronous to when data are actually entered.) As a result, the data for a given table are stored in a somewhat randomized order. For instance, suppose that contact information corresponding to various customers are stored in a table that includes various rows for storing the contact information, such as last name, first name, address, city, state, phone number, etc. As new contact information is entered into the database, data pertaining to a new row will be written to the initial datafile (or currently active datafile) corresponding to the tablespace the table is assigned to. Since the contact information will usually not be entered in a sequential manner (e.g., alphabetically be last name), a sequential row-by-row examination of the data in the datafile will appear to not follow any predetermined ordering scheme, as illustrated by a randomized set of all contact records 18 in database 10.
In contrast to the foregoing data storage scheme, database users (both people operating client machines and internal software components that perform batch operations) typically desire to retrieve and/or view data in an ordered (i.e., sorted) manner. To meet this criteria, databases provide various solutions for providing requested data in a sorted configuration based on specified sort criteria, such as last name sorted alphabetically. In general, the data corresponding to a requested data set (i.e., query) must first be retrieved to a temporary sort space (comprising physical storage and/or memory space), whereupon the data are sorted corresponding to the predefined ordering scheme, and then provided to an application, applet, or module that is used to present the data to the user (or provide the data to an internal software component user).
In non-indexed queries (i.e., a query that doesn't use a primary key or other index for a table or set of related tables), a full table scan must be performed to retrieve appropriate rows specified by the query (the result set or recordset), and then a memory or disk sort must be performed on the result set prior to providing the data to the user as an “open” data set. For index-based queries, one or more indexes are used to identify the appropriate rows of data in the result set, whereupon mapping information provided by the index(es) (e.g., row IDs) that identifies where those rows of data are located is used to retrieve appropriate rows of data meeting the query search criteria. In some instances, the rows of data are retrieved in a sorted order. In other cases, this is not possible or impractical, and the rows of data are first retrieved and then sorted in the sort space.
For example, suppose a user wanted to view all of the contacts stored in database 10 sorted by their last names. In the two-tier architecture illustrated in FIG. 1, the user is running a client-side application 20 on client machine 14, which includes business logic 22 that generates a query 24 in response to a user request to view all of the contract records and submits query 24 to database 10 via network 16. Assuming the data are stored in a single CONTACT table that includes a LASTNAME column, the corresponding SQL query would look like:
SELECT*FROM CONTACTS
ORDER BY LASTNAME;
Generally, a request for all records in a table will cause a table scan to be performed, regardless of whether or not indexes are used. The rows in the contact table (stored as randomized set of contact records 18) are retrieved and sent to a sort area 26, which depending on the size of the sort may comprise a memory space and/or a temporary disk storage space. The rows are then sorted, using a sort process 28 and then provided back to client-side application 20 as a sorted full recordset 30.
The retrieval and sorting of data can be very time-consuming, particularly if the sorting has to be performed using a disk sort. This is often the case for large result sets, which may involve millions, or even 10's of millions of rows of data. In such instances, even an optimized query using indexes might take 10's of minutes or even hours. Furthermore, such queries are resource (CPU and memory) intensive, often slowing access by other users to a substantial halt.
In response to this problem, RDBMS database vendors have developed various schemes to provide data to GUI-based applications that function as client-side front ends to enable users to access and view the database data. Typically, these schemes are centered around “virtual” lists of data that are either continually loaded using background operations or implemented through the use of built-in SQL commands, such as the TOP command. These conventional schemes are limited in their ability to retrieve and provide data to end users when large data sets are queried.
In continuance of the foregoing example, in one conventional embodiment, sorted full recordset 30 is logically stored as a virtual list that is managed by a virtual list manager 32. (In instances in which a very large number of rows are to be accessed, virtual list manager 32 may actually manage smaller sets of all of those rows that are continually being retrieved from database 10 using a background process—for convenience, a complete data set is illustrated in FIG. 1.) Virtual list manager 32 interacts with a GUI manager 34 that is used to provide display information to a video subsystem that includes a display device (e.g., monitor) 36 to generate a GUI display screen 38 that enables the user to selectively view contact data stored in database 10. Typically, virtual list manager 32 will implement a filter 40 to pass a small number of rows of data to GUI manager 34, which then displays the rows in GUI display screen 38.
The conventional schemes for retrieving and display data from very large data sets have several problems. Most notably, they are very slow, consume a great deal of resources, and in the case of huge data sets, may not even be possible to perform in a useable manner.