This invention relates to the field of database management systems. More particularly, a system and methods are provided for indexing multi-dimensional data, storing such data in a relational database management system and efficiently retrieving the data upon demand.
Various methods of managing collections of data (e.g., databases) have been developed since data was first stored in electronic form. From initial systems and applications that simply collected data in one or more flat database files to present sophisticated database management systems (DBMS), different solutions have been developed to meet different requirements. Early solutions may have had the advantage of simplicity but became obsolete for a variety of factors, such as the need to store largexe2x80x94even vastxe2x80x94quantities of data, a desire for more sophisticated search and/or retrieval techniques (e.g., based on relationships between data), the need to store different types of data (e.g., audio, visual), etc.
A database may be considered distinct from a DBMS, in that a DBMS, as the name implies, includes utilities for accessing, updating and otherwise managing or operating a database. As the amount and complexity of data stored in databases has increased, DBMS design and development efforts have increasingly focused upon the ability to organize, store and access data quickly and efficiently. As a result, today""s database management systems can be very effective in managing collections of linear information such as inventory, customer lists, etc.
With such linear (or uni-dimensional) dataxe2x80x94data that varies in value in a single dimensionxe2x80x94determining and maintaining relationships between data elements is relatively easy. For example, one value or data point (e.g., a price, a quantity) can easily be compared to another to determine which is xe2x80x9cgreaterxe2x80x9d or which ones match a particular query. The ordinal nature of linear data therefore readily lends itself to basic indexing and subsequent storage, search, retrieval and other data management operations. In particular, the appropriate point for insertion, retrieval or deletion of a data element in a database of linear data may be found with great facility by referring to a table or other data index.
In short, today""s database management systems have been designed to manage linear data very effectively. Present database management schemes are still poorly suited, however, for managing data that are multi-dimensional in nature. Geographic data, for example, may be meaningful only when expressed in at least two dimensions (e.g., latitude and longitude) and can thus be considered xe2x80x9cinherentlyxe2x80x9d multi-dimensional. Because such data can vary in value in more than one dimension, the relationship between selected geographic points is more complex and, unless a particular reference dimension or other criteria is chosen, one point cannot automatically be considered xe2x80x9cgreaterxe2x80x9d or xe2x80x9clessxe2x80x9d than another. The difficulty in expressing relations among sets of inherently multi-dimensional data makes indexing such data (e.g., for storage and retrieval) more complicated.
Closely related to inherently multi-dimensional data is multi-dimensional data that may also be termed xe2x80x9cmulti-attributexe2x80x9d in nature. Multi-attribute data may be defined as information that possesses multiple defining characteristics that are not inherently related. For example, sales data may be considered multi-attribute in nature if characteristics such as time of sale, region of sale, salesperson, product sold, type/model of product, etc. are recorded for each sale. This data becomes multi-dimensional or multi-attribute in nature when queries are made or reports are desired that specify range predicates in two or more of the data attributes (e.g., the sales made by Salesperson A during the previous month, the best time of year for selling a particular product). Unfortunately, today""s database management systems are not designed to organize this data in a manner that enhances the ability to retrieve those data items satisfying a particular multi-dimensional query pattern.
Present techniques for dealing with (e.g., indexing, storing, retrieving) multi-dimensional data often involve attempts to translate the data into a single dimension so that existing systems may be used. These techniques often fail to maintain important spatial relationships, however, thus adversely affecting the ability to respond rapidly to multi-dimensional queries. For example, linear quadtrees and Hilbert R-trees transform multi-dimensional data into a single dimension and then construct B-trees on the linearized data. Although these schemes may be adequate for two-dimensional data, linearizing data having three or more dimensions may result in an unacceptable loss of spatial relationship.
Meanwhile, the number and types of applications that use multi-dimensional and multi-attribute dataxe2x80x94such as geographic information systems (GIS) and computer-aided design and manufacturing (CAD/CAM) systemsxe2x80x94continue to grow. A GIS application may work with maps or charts in which spatial data are expressed in two or three dimensions (e.g., latitude, longitude and, possibly, altitude). Similarly, in CAD/CAM applications products such as printed circuit boards for computer systems may be designed using rectangular areas or cubic volumes. A person using one of these applications may select an area of interest that contains or intersects one or more elements. The application must be able to accurately identify those elements and allow ready modification of the data.
Multi-media applications, in which audio and visual elements are combined in one database, are another area that can benefit from efficient storage of multi-dimensional data. For example, an element of a graphical image may require multi-dimensional data for accurate representation. In particular, graphic images may be described in terms of numerous spectral characteristics (all or some of which may be inherently inter-related). Thus, in a graphical element that embodies a combination of colors (e.g., some mixture of red, green and blue pixels on a computer display) the different colors of the element may be represented in different dimensions. For example, each dimension""s value could represent the relative proportion of the corresponding color within the element or the number of pixels within the element that have that color. Accurate representation of this data would allow an application to easily identify elements that have similar coloring.
Because effective methods of organizing multi-dimensional data in a DBMS have been generally unknown, applications that use such data have been unable to reap the advantages offered by today""s database management systems, especially relational database management systems (RDBMS). Among those advantages are backup and recovery utilities, robust security, and controls allowing concurrent data access. Developers of these applications have had to rely upon other methods of indexing and storing such data.
What is needed then is a method of organizing multi-dimensional/multi-attribute data in a DBMS, particularly a relational DBMS, in order to reap the advantages of sophisticated management controls (e.g., concurrent access to the data) without sacrificing spatial relationships. Advantageously, such a DBMS would provide for efficient organization of the data to facilitate its rapid retrieval. Retrieval of the data may be enhanced by applying an effective buffering technique.
In one embodiment of the invention, a system and methods are provided for storing a hierarchical index of multi-dimensional data in a relational database management system (RDBMS).
In addition to providing for effective management of data that is inherently multi-dimensional (e.g., geographic, multi-media), this embodiment also provides for the storage and management of linear data that has multiple attributes. For example, a database of sales figures may be indexed according to attributes such as product, time, salesperson, geographic region, etc.
In an embodiment of the invention, a set of multi-dimensional/multi-attribute data items is indexed by recursively dividing the data items into smaller clusters until each cluster can be stored (i.e., indexed) in a single leaf node of a hierarchical (e.g., tree-structured) index. In this embodiment, when the set of data items or a subset thereof is too large to fit in a single leaf node, a suitable dimension/attribute in which to divide the data items is selected. The capacity of a node may be specified as a fanout characteristic of the index or may be determined by a parameter of a suitable physical storage device.
A dimension or attribute in which to divide the data may be selected on the basis of which one consists of data item values having the greatest variance or range. Alternatively, a dimension may be selected based upon an expected or specified query pattern. When a dividing dimension is selected, the data items may be sorted in that dimension and then divided into two or more subsets that contain equal or nearly equal numbers of data items. After leaf nodes are constructed for clusters of data items, intermediate and, finally, a root node may be constructed to complete the index.
In one embodiment of the invention, a hierarchical index (e.g., an R-tree index) of multi-dimensional or multi-attribute data may be stored in a database, such as a relational database management system. In this embodiment, a first object or table in the database is configured to store information concerning the index (e.g., its dimensionality, fanout) and possibly an identifier (e.g., an address or storage location, a unique node identity) of a root node of the index. A second object or table is configured to store a record or row for each node of the index. The multi-dimensional data items may be stored in one or more objects or tables, in the same or a different database.
In the second object or table, each record for an index node may consist of items such as: a unique identifier of the corresponding node, an identifier of a parent node, an identifier of a sibling node, a measure of the number of children of the node, and an entry for each child. In one embodiment of the invention, each child entry includes an identifier of the child, which may be a data item (if the node is a leaf node) or another node. Each record also includes a bounding region or area that encompasses the data item (if the node is a leaf node) or all data items that descend from the node (i.e., all data items below the node that are connected to the node through one or more intervening nodes).
In one embodiment of the invention in which a user""s query will likely match (within a range of exactitude) one of a set of known query patterns, the data items may be clustered for indexing in an appropriately corresponding manner. Thus, if one query pattern expresses a particular order of hierarchy between the dimensions/attributes of the data, the data items may be divided and clustered accordingly in order to create an index tailored to providing an efficient response to an actual query. Multiple indexes may thus be created (and stored in a database) for a given set of data items.