This invention relates to the field of database management systems. More particularly, a system and methods are provided for indexing multi-dimensional data, storing such data in a relational database management system and efficiently retrieving the data upon demand.
Various methods of managing collections of data (e.g., databases) have been developed since data was first stored in electronic form. From initial systems and applications that simply collected data in one or more flat database files to present sophisticated database management systems (DBMS), different solutions have been developed to meet different requirements. Early solutions may have had the advantage of simplicity but became obsolete for a variety of factors, such as the need to store largexe2x80x94even vastxe2x80x94quantities of data, a desire for more sophisticated search and/or retrieval techniques (e.g., based on relationships between data), the need to store different types of data (e.g., audio, visual), etc.
A database may be considered distinct from a DBMS, in that a DBMS, as the name implies, includes utilities for accessing, updating and otherwise managing or operating a database. As the amount and complexity of data stored in databases has increased, DBMS design and development efforts have increasingly focused upon the ability to organize, store and access data quickly and efficiently. As a result, today""s database management systems can be very effective in managing collections of linear information such as inventory, customer lists, etc.
With such linear (or uni-dimensional) dataxe2x80x94data that varies in value in a single dimensionxe2x80x94determining and maintaining relationships between data elements is relatively easy. For example, one value or data point (e.g., a price, a quantity) can easily be compared to another to determine which is xe2x80x9cgreaterxe2x80x9d or which ones match a particular query. The ordinal nature of linear data therefore readily lends itself to basic indexing and subsequent storage, search, retrieval and other data management operations. In particular, the appropriate point for insertion, retrieval or deletion of a data element in a database of linear data may be found with great facility by referring to a table or other data index.
In short, today""s database management systems have been designed to manage linear data very effectively. Present database management schemes are still poorly suited, however, for managing data that are multi-dimensional in nature. Geographic data, for example, may be meaningful only when expressed in at least two dimensions (e.g., latitude and longitude) and can thus be considered xe2x80x9cinherentlyxe2x80x9d multi-dimensional Because such data can vary in value in more than one dimension, the relationship between selected geographic points is more complex and, unless a particular reference dimension or other criteria is chosen, one point cannot automatically be considered xe2x80x9cgreaterxe2x80x9d or xe2x80x9clessxe2x80x9d than another. The difficulty in expressing relations among inherently multi-dimensional data makes indexing such data (e.g., for storage and retrieval) more complicated and can also complicate schemes for buffering the data as queries are processed.
Closely related to inherently multi-dimensional data are multi-dimensional data that may also be termed xe2x80x9cmulti-attributexe2x80x9d in nature. Multi-attribute data may be defined as information that possesses multiple defining characteristics that are not inherently related. For example, sales data may be considered multi-attribute in nature if characteristics such as time of sale, region of sale, salesperson, product sold, type/model of product, and so on are recorded for each sale. These data become multi-dimensional or multi-attribute in nature when queries are made or reports are desired that specify range predicates in two or more of the data attributes (e.g., the sales made by Salesperson A during the previous month, the best time of year for selling a particular product). Today""s database management systems are, unfortunately, not designed to organize this data in a manner that enhances the ability to retrieve those data items satisfying a particular multi-dimensional query pattern.
Present techniques for dealing with (e.g., indexing, storing, retrieving) multi-dimensional data often involve attempts to translate the data into a single dimension so that existing systems may be used. These techniques may fail to maintain important spatial relationships, however, thus adversely affecting the ability to respond rapidly to multi-dimensional queries. For example, linear quadtrees and Hilbert R-trees transform multi-dimensional data into a single dimension and then construct B-trees on the linearized data. Although these schemes may be adequate for two-dimensional data, linearizing data having three or more dimensions may result in an unacceptable loss of a spatial or other data relationship.
Meanwhile, the number and types of applications that use multi-dimensional and multi-attribute dataxe2x80x94such as geographic information systems (GIS) and computer-aided design and manufacturing (CAD/CAM) systemsxe2x80x94continue to grow. A GIS application may work with maps or charts in which spatial data are expressed in two or three dimensions (e.g., latitude, longitude and, possibly, altitude). Similarly, in CAD/CAM applications, products such as printed circuit boards for computer systems may be designed using rectangular areas or cubical volumes. A person using one of these applications may select an area of interest that contains or intersects one or more elements. The application must be able to accurately identify those elements and facilitate modification of the data.
Multi-media applications, in which audio and visual elements are combined in one database, are another area that can benefit from efficient storage of multi-dimensional data. For example, an element of a graphical image may require multi-dimensional data for accurate representation. In particular, graphic images may be described in terms of numerous spectral characteristics (all or some of which may be inherently inter-related). Thus, in a graphical element that embodies a combination of colors (e.g., some mixture of red, green and blue pixels on a computer display) the different colors of the element may be represented in different dimensions. Each dimension""s value could, for example, represent the relative proportion of the corresponding color within the element or the number of pixels within the element that have that color. Accurate representation of this data would allow an application to easily identify elements that have similar coloring.
Because effective methods of organizing multi-dimensional data in a DBMS have been generally unknown, applications that use such data have been unable to reap the advantages offered by today""s database management systems, especially relational database management systems (RDMBS). Among those advantages are backup and recovery utilities, robust security, and controls allowing concurrent data access. Developers of these applications have, instead, had to rely upon other methods of indexing and storing such data.
What is needed then is a method of organizing multi-dimensional/multi-attribute data in a DBMS, particularly a relational DBMS, in order to reap the advantages of sophisticated management controls (e.g., concurrent access to the data) without sacrificing spatial relationships. Advantageously, such a DBMS would provide for efficient organization of the data to facilitate rapid retrieval.
Further, an efficient method or system for facilitating retrieval of multi-dimensional/multi-attribute data is also needed. In particular, a suitable technique is required for buffering data as database queries are processed.
In addition to providing for effective management of data that are inherently multi-dimensional (e.g., geographic, multi-media), an embodiment of the invention also provides for the storage and management of linear data having multiple attributes. For example, a database of sales figures may be indexed according to attributes such as product, time, salesperson, geographic region, etc.
In addition to providing for effective management of data that is inherently multi-dimensional (e.g., geographic, multi-media), an embodiment of the invention also provides for the storage and management of linear data having multiple attributes. For example, a database of sales figures may be indexed according to attributes such as product, time, salesperson, geographic region, etc.
In one embodiment, a set of multi-dimensional/multi-attribute data items is indexed by recursively clustering the data items into smaller collections until each cluster can be stored (i.e., indexed) in a single leaf node of a hierarchical (e.g., tree-structured) index. In particular, when the set of data itemsxe2x80x94or a subset thereofxe2x80x94is too large to fit in a single leaf node, a suitable dimension/attribute by which to divide the data items is selected and the set or subset is divided accordingly. The capacity of a node may be specified as a fanout characteristic of the index or may be determined by a parameter of a suitable physical storage device (e.g., the capacity of a disk page).
The selected dimension or attribute in which to divide a data set or subset may be the one having data item values exhibiting the greatest variance or range. Alternatively, a dimension may be selected based upon an expected or specified query pattern. Illustratively, when a dividing dimension is selected, the data items are sorted in that dimension and then divided into two or more subsets that contain equal or nearly equal numbers of data items. After leaf nodes are constructed for clusters of data items, intermediate nodes and, finally, a root node may be constructed to complete the index. Each higher-level node is designed to encompass or contain its children nodes.
A hierarchical index (e.g., an R-tree index) constructed to contain multi-dimensional or multi-attribute data may be stored in a database environmentxe2x80x94such as a relational database management system. In one embodiment of the invention a first object or table in a database is configured to store information concerning the index (e.g., its dimensionality, fanout) and possibly an identifier (e.g., an address or storage location, a unique node identity) of a root node of the index. A second object or table is configured to store a record or row for each node of the index. The multi-dimensional data items may be stored in one or more objects or tables, in the same or a different database.
In the second object or table mentioned above, each record for an index node may consist of items such as: a unique identifier of the corresponding node, an identifier of a parent node, an identifier of a sibling node, a measure of the number of children of the node, and an entry for each child. In one embodiment of the invention each entry for a child node includes an identifier of the child, which may be a data item (if the node is a leaf node) or another node. Illustratively, each index record also includes a bounding region or area that encompasses the data item (if the node is a leaf node) or all data items that descend from the node (i.e., all data items below the node that are connected to the node through one or more intervening nodes).
In an embodiment of the invention in which a user""s query will likely match one of a set of known query patterns (possibly within a range of accuracy), the data items may be clustered for indexing in an appropriate and corresponding manner. Thus, if a possible query pattern expresses a particular order or hierarchy between the dimensions/attributes of the data, the data items may be divided and clustered accordingly in order to create an index tailored to facilitating an efficient response to an actual query. Multiple indexes may thus be created (and stored in a database) for a given set of data items, with each index reflecting a different scheme of clustering.
In one or more embodiments of the invention techniques are provided for buffering index nodes during operations (e.g., additions or deletions of a data item, queries, reorganizations) on multi-dimensional and/or linear data. A first buffering technique buffers nodes in the order of the size of their bounding regions or areas. A node""s bounding region may be stored, with an identifier of the node, in an entry in its parent node. A root node""s bounding region is considered to encompass the entire set of multi-dimensional data items indexed within the hierarchical structure. This technique is particularly well suited for use when the data operations are distributed relatively uniformly across the dataspace.
A second buffering technique involves the collection of statistics concerning the frequency and/or recency with which nodes and/or data items are accessed (e.g., during queries and updates). Nodes may then be buffered based on the frequency or recency with which they have been accessed or the frequency or recency with which data items within their bounding regions have been accessed. The nodes that are buffered may change over time as additional operations are conducted. This second technique is well suited to environments in which data operations are not uniformly distributed within the dataspace.
In one embodiment of the invention index nodes are first buffered according to the size of their bounding regions. Then, after a meaningful number of operations have been performed, nodes are buffered according to the frequency or recency of access.
A third, generic, technique of buffering uses a combination of two or more of the node region, recency of access and frequency of access criteria to determine which buffered node is to be ejected from a buffer in favor of a newly accessed node. In one embodiment of the invention implementing this technique, all three of these criteria are applied. In this embodiment, two buffered nodes are compared at a time to determine which provides less caching or buffering benefit in accordance with the selected criteria. By successively comparing the losing node of one comparison with another buffered node, eventually the node providing the least caching benefit is identified.
In this embodiment, the node region criteria are measured by determining the difference between the two nodes"" bounding regions; this difference may be divided by one of the two bounding regions in order to normalize this factor. The recency of access criteria is measured by calculating the difference in time between the last accesses of the two nodes; this difference may be normalized by dividing it by the size of the buffer or cache. The frequency of access criteria is measured by determining the difference between the number of times each node was accessed during a set of data operations. The values resulting from the three criteria are then accumulated, possibly after multiplying one or more of them by weighting factors. If the result is greater than a certain threshold (e.g., zero), then one node is considered to have less caching benefit and may therefore be ejected; otherwise, the other node will be ejected.