The present invention relates to methods and systems for the creating database indexes. More particularly, the present invention relates to the creation of such an index in a multiprocessor environment.
In order to manage large quantities of data, computer database applications have been developed to organize and store the data in a logical manner. Typical computer databases comprise a significant number of records of information, wherein each record comprises a predetermined number of fields. Additionally, a database management system is typically used to provide the software tools to manipulate the database more simply. Example database management systems include Microsoft(copyright) Access, Microsoft(copyright) SQL Server, among others. A typical database management system provides the user the ability to add, modify or delete data, and the ability to search, sort, or recombine records in the database. Moreover, the usual system also provides functionality related to maintaining the security and integrity of the database data.
In managing large quantities of data, a database application alone may not provide the speed and flexibility in accessing desired data. Therefore, xe2x80x9cindexesxe2x80x9d are often used to provide a presorted view of the database, or more particularly, a portion of the database known as a table. An index is a representation of a database table that is sorted by a predetermined field or fields within the records. Using these indexes, queries may be resolved more efficiently since each index provides relatively short paths to desired information.
As an example, consider a personnel or employee database table incorporating many records of information and wherein each record relates to a one employee. Each record contains numerous fields that contain specific data or other identifying elements for that one employee, such as last name, age, date of birth, sex, etc. Assume further that the entire database table is sorted on the last name of each employee, e.g., by placing the names in alphabetical order. When a user desires information about employees other than something based on name, such as when a user wants to know all employees under the age of 30 for example, there is no simple way to discern this information. Typically, the entire database table must be sorted or at least evaluated based on age and then the user is able to locate the desired records. Such a sorting or scanning process consumes a significant amount of time. For this reason, table indexes may be created that comprise sorted information based on key fields such as age, birth date, sex, etc. Therefore, when information is needed based on one of these fields, i.e., when performing a query, the most relevant index may be used to quickly locate specific records of interest.
In order to generate a table index, a typical database management system must access each record and analyze the particular field or fields, i.e., key field(s) relevant to the particular index, e.g., the age field for the an age index. Using the values within the key field, the database management system orders, sorts or otherwise positions each record into a particular location as compared with the other records. This process can take considerable time because each record must be accessed from the disk and analyzed and then sorted.
It is with respect to these and other considerations that the present invention has been made.
The present invention relates to the creation of a database table index using more than one microprocessor. In order to use more than one processor to create the index, a pre-processing phase partitions the various database table records amongst the multiple processors. Each partition relates to non-overlapping data such that each processor can operate independently of the others to create a sub-index. Once the various sub-indexes are created, a merge operation merges all the sub-indexes into a final, complete index. An upper or top-level index object is created that relates to the combination of the various indices created by the processors.
In accordance with certain aspects, the present invention relates to a method of creating an index for a table of records in a computer environment having a plurality of processing units wherein each processing unit has access to the database table. The method first determines partition delimiters wherein each partition delimiter separates the table into non-overlapping partitions of records. Each of these partitions is dedicated to one processing unit for index creation. Next, each processing unit independently creates a sub-index, i.e., different processing units create at least two sub-indexes. Last, the method merges the sub-indexes together to create a final index related to the database table.
In accordance with other aspects, the present invention further involves each processing unit accessing the database table records in parallel, i.e., wherein each processing unit access each of the records and then filters the accessed records to keep only relevant records for its sub-index. Each processor then sorts its kept records and generates a data structure based on the sorted records. The data structure is essentially the sub-index and may be a balanced tree, i.e., a B-Tree having multiple levels, and it may or may not be a clustered index.
In accordance with other aspects, the present invention also gathers sub-index statistical information and later stitches sub-index statistical information together to create a final index statistical information data structure. The method of gathering sub-index statistical information involves evaluating record field information; generating a histogram relating to the evaluation of the record field information; creating a linked list of data objects related to the histogram; determining variance values between consecutive data objects and storing the variance values in a heap; and reducing the linked list by combining data objects having a relatively low variance value between the respective data objects. The act of stitching the sub-index statistical information involves merging the heap information for each sub-index; linking the linked lists for each sub-index to create a stitched linked list; determining variance values between linked objects; determining a relatively low variance value; and reducing the stitched linked list by combining linked objects associated with the relatively low variance value to created a combined data object. Additionally, the combined data object is linked within the stitched linked list to a previous data object and a subsequent data object. A first variance value between the combined data object and previous data object and a second variance value between the combined data object and the subsequent data object are then calculated and stored in the merged heap so that the stitched linked list may be reduced. The stitched linked list may be reduced to a predetermined number of objects or until the lowest variance value is above a predetermined threshold value.
In accordance with other embodiments, the present inventive method act of determining partition delimiters involves sampling the database table records to determine an approximate distribution of the values in the key field; creating a histogram based on the sampled information; and evaluating the histogram to determine the partition delimiters. Further, this determination may involve determining a processor goal value based on the number of processors in the computer system; determining a least common multiple value based on the processor goal value; and determining whether the histogram information may be substantially evenly split into the least common multiple value number of partitions. If so, creating the partition delimiters based on the least common multiple value or, if not, adjusting the processor goal to determine a new least common multiple value to determine partition delimiters to eventually assign or allocate substantially equal numbers of records to each partition.
The invention may be implemented as a computer process, a computing system or as an article of manufacture such as a computer program product. The computer program product may be a computer storage medium readable by a computer system and encoding a computer program of instructions for executing a computer process. The computer program product may also be a propagated signal on a carrier readable by a computing system and encoding a computer program of instructions for executing a computer process.
A more complete appreciation of the present invention and its improvements can be obtained by reference to the accompanying drawings, which are briefly summarized below, and to the following detailed description of presently preferred embodiments of the invention, and to the appended claims.