Computers, and in particular, computer database applications, are used by businesses and other organizations to monitor and record information about an organization's activities. Often, the organization will have various processes or activities that must be performed, and which recur frequently. Indeed, it is common for an organization to have numerous instances of an activity in various stages of completion at any given time. As one example, a business may sell goods based on orders received from customers. An activity of interest may be fulfilling those customer orders; each purchase order represents a separate instance of that activity. At any particular time, that business may have multiple instances of the activity (i.e., multiple orders from multiple customers) in various stages of completion. As but another example, a financial institution may loan funds to customers based on applications from those customers. An activity of interest may be the processing of a loan application to completion (i.e., approval or rejection), with each application representing a separate instance of the activity. At any particular time, there may be multiple loan application instances in various stages of processing. As yet another example, a governmental entity responsible for issuing permits may have multiple permit applications in various stages of being processed.
In order to monitor numerous instances of an activity, many organizations store information about those activity instances in a database program. In particular, a record or other data object can be created for each instance of the activity. A separate field or other component of the record is then established to hold a value for some type of information common to each instance. Using one of the previous examples as an illustration, a business selling goods may create a separate database record for each customer order. Within that record may be separate fields for the time the order was received, where the order was received, what was ordered, when the order was shipped, etc. Such use of a database program is often conceptualized as a table. Each instance of the activity is assigned a separate row (or tuple) of the table. Each type of information common to multiple instances is then assigned a separate column of the table.
By placing data for each instance of an activity in a database table, it is then possible to analyze the data in various ways. As more and more records accumulate, however, the usefulness of a database can decrease. For a large business such as a goods seller receiving hundreds or thousands of orders per day, the number of records can reach into hundreds of thousands or millions. Each time the database is queried, a finite amount of time is needed to search a disk drive or other storage device. Similarly, as new records are created and existing records updated, a finite amount of time is needed to create or update each of those records. As the number of records grows, the time needed to find a particular record increases. In a business or organization having hundreds (or thousands) of users and hundreds of thousands (or millions) of database records, the latency for database system access can become quite substantial and the system disk(s) may become full.
FIGS. 1-3 provide a more detailed illustration of this problem, and also provide an example to build upon in the subsequent Detailed Description of the Preferred Embodiments. FIG. 1 is a flow chart showing processing of customer purchase orders by a hypothetical wholesale business which sells goods to customers based on customer purchase orders. For convenience, the business will be referred to herein as “Business A.” At block 1, Business A receives a purchase order and creates a database record for the purchase order; the time of order receipt is also entered. At blocks 2 and 3, additional data is input for record fields corresponding to quantity of product ordered and the purchaser's city. At block 4 a decision is made regarding whether the purchase order will be accepted. If not, an appropriate field of the record is populated at block 5 and the time of denial recorded. If the purchase order is approved, the approval is noted. If the purchase order is approved, additional information may be entered (blocks 6 and 7). When the order is shipped (block 8), another field is populated with the time of shipment. Further data may be input at blocks 9 and 10 (e.g., the type of shipping container and the carrier). When the order is delivered, the time of delivery is input (block 11)
FIG. 2 is a table representing a portion of the database for purchase order instances of Business A. Each order is on a separate row, and each column corresponds to a type of data for an order. For simplicity, FIG. 2 only shows columns for some of the information collected in the flow chart of FIG. 1. Certain fields contain NULL values, indicating (in this example) that the value for a particular event is unknown because it has not yet transpired as to that particular purchase order. A typical query of this database might be “which purchase orders above $1000, and that were submitted last week, have not yet been approved or denied?” Such a query could be implemented via a SQL (structured query language) query on the table. Appendix A shows an implementation of SQL code to create the table of FIG. 2 (“create table PO_InstanceData”) and a stored procedure to update the rows of the table in FIG. 2 (“create procedure PO_PrimaryImport”). This stored procedure accepts a PONum argument that uniquely identifies a record corresponding to one activity instance (in this example, a particular purchase order), and one parameter for each column in the table. Multiple arguments of the stored procedure allow the name-value pairs for the call to be accumulated in memory (based on the transactional logic of the application program collecting the data) before attempting to update or insert a record. Because of the processing overhead required for locking and accessing a database record, it is more efficient to update (or insert) in one command as much of a record as possible. The stored procedure first tries to override the non-null columns of the record by issuing an “update” statement, assuming that some data for a PO exists in the table. Here the function “coalesce” is used, which returns the first non-null argument. If no records were updated (@@rowcount=0), this is the first piece of information related to this purchase order, and a record is inserted with the value of all the arguments (even if they are null).
A stored procedure such as in Appendix A is satisfactory when relatively few users or programming threads are attempting to write to a table and when there are relatively few records. Unfortunately, and as shown in FIG. 3, performance degrades over time as the number of records in a table grows. The solid line represents write performance, or the number of records that can be written per second, and rapidly drops to a low level. Conversely, the average Queue Length for disk I/O (dashed line) soon increases beyond acceptable limits. This performance degradation results from the increase in table size. When the number of records is relatively small, performance is initially limited by the speed with which the database server can perform transactions, which is in turn dependent upon the capacity of the server's central processing unit(s). As the number of records increases over time (e.g., as more and more purchase orders are received and processed), performance drops drastically. For example, a first execution of the update statement in the stored procedure of Appendix A would cause a portion of the table (FIG. 2) to be read from a disk (or other non-volatile memory source) into RAM (or other memory system). As long as the total number of records is small, the server is able to cache most or all of the required data in system memory. If a subsequent update requires access to a record already cached in system memory, the server is not required to read the disk again. When the number of the records exceeds memory capacity, however, each operation can require a physical read of the disk. Depending on the hardware used, other queries against (or updates of) the table may be prevented while the disk is being read. Ultimately, this causes unacceptably slow response times for all users. The problem can be exacerbated when more complex manipulation of data in the table must be performed, such as On-Line Analytical Processing (OLAP) and creation of OLAP cubes.