Databases and database management systems are being implemented in more and more businesses, organizations and institutions, and are being used to store and manage increasingly large amounts of data of increasingly different types and complexity. It is a continuing objective to increase or maintain the speed of responding to data search queries and data analysis tasks in the face of at least the following trends:                Increased diversity in user needs        Increased demands for ad hoc and complex decision support queries vs. standard canned reports        Need for more “real time” information        Rapidly growing volumes of data        
As a result, there is a need for improved database management solutions to facilitate rapid execution of database queries.
Traditional approaches to database organization fail to effectively deliver a solution to such diverse requirements because of high computational cost, and slow query performance against large volumes of data. As evidenced, for example, in Fayyad, et al., U.S. Pat. No. 6,374,251 (“Fayyad”), there is an acute, largely unsolved, challenge to sift through large databases in search of useful information in a reasonable amount of time, especially where the database size may be much larger than the available computer memory.
In U.S. application Ser. No. 11/854,788, filed Sep. 13, 2007, and assigned to the assignee of the present application, metadata parameters of various types (referred to therein as data information units, data pack nodes, knowledge nodes and/or statistical data elements) were disclosed as means for characterizing the content of data units where the data units were, for example, a group of data from a single category (or column) extracted from a group of data records. The disclosed method enabled faster processing of a data query by, inter alia, using the metadata parameters to avoid searching individual data records (or fractions thereof) when the metadata parameter indicated those records (or fractions thereof) would not contain data responsive to the data query. Thus, the metadata parameter improved planning and execution of data query operations, irrespective of how the data records were ordered.
Some known methods organize databases by partitioning data into smaller blocks to assist in subsequent data operations. Partitioning may provide for a physical data model design aimed at organization and enhancement of the data transparently to the logical data model layout. Such physical data model design approaches are generally based on certain assumptions about the way users will use the system and usually leave the tuning task to the database administrator (DBA). Some known approaches attempt to analyze automatically the data and query samples to assist the DBA as a kind of decision support tool while others attempt to alert the DBA during the system's operation whenever the users' requirements have changed significantly enough to reconsider the current design.
However, dynamic changes in the users' requirements are not handled sufficiently flexibly by such existing tools. In particular, there is a lack of flexibility in known RDBMS data partitioning techniques, which split the data into separate groups defined in terms of distinct values or ranges of some columns or functions. Such partitioning criteria are so closely related to specific query and data patterns that any change in the query workload or incoming data may cause a significant decrease in the query performance until the data is re-partitioned.
Furthermore, dynamic changes in the database itself, as a result of arrival of new data, are handled inefficiently by such existing methods. For example, in the case where a user of a RDBMS wants to run queries immediately after new portions of data are loaded into the previously existing data sets, it is not acceptable that recalculation of the data structures over the old data merged with the new data takes a larger amount of time. Instead, there is a growing expectation that both old and newly arriving data should be available for efficient querying as it arrives and without delay.
Techniques of data mining known in the art as “cluster analysis” look for relations between data elements in a set of data records. Cluster analysis has been defined as “the organization of a collection of patterns (usually represented as a vector of measurements, or a point in a multidimensional space) into clusters based on similarity.” Jain, A. K., et al., Data Clustering: A Review, ACM Computing Surveys, Vo. 31, NO. 3, September 1999. Data “clustering” in the context of data mining has the objective of grouping similar data items into clusters that would be practically meaningful to the end users of a data mining system. Such methods seek to extract useful information from an essentially unordered (or randomly ordered) set of records by finding relationships between data items. As noted above, present methods of applying such techniques to large databases are costly in terms of execution time and computer resources.
Accordingly, what are needed are improved methods and systems for optimizing the organization of a relational database so as to improve the efficiency of executing search queries on the database.