Databases are repositories of information stored on a storage medium whereafter the data can be retrieved. Many databases today are found in any large institution, such as a bank or a large corporation, and have the size to store millions and millions of records of information, which can be accessed by a user. A record in a database is any clustered set of fields of information, such as an employee's name, address, and serial number, among others. With the volume of these databases ever increasing, accessing the information quickly and effectively, i.e., accurately retrieving all the desired information, continues to be an area of interest.
If the database contains millions of records of information, a query (or request) for all the employees with the identical last name would likely produce multiple records, and require a search of all of the individual records of the database. Clearly, if the database was searched sequentially (in other words from the first record to the last record (then an unduly long length of time would have to pass before a complete answer to a user's query was obtained, consolidated, and returned to the requesting user. Slowdowns in the database's response time due to long information access times are not a tolerable condition in any database system; especially those wherein large amounts of data are continuously and simultaneously accessed and retrieved. Consequently, the database arts have developed many schemes to quickly find the qualifying records such that an entire database scan can be avoided. Database indexes are such schemes used to find a fast path to answer queries on one or more of a record's fields.
Many databases have indexes for the most frequently requested information contained therein. In other words, each such field of information for each record is indexed. An index can be described as a list of distinct attribute record values associated with pointers (i.e., record ids.). Typically the list of values is organized as a B-Tree or as a Hash Table. For instance, all the employees' last names would be indexed such that, when a user query specified the retrieval of all the records with the last name, of Smith, the index table would be quickly scanned for that name, and the pointers would point to the pages in the database in which those matching records resided. Thus by indexing a database's individual fields, queries on the indexed fields enjoy a fast response time.
With the advent of the multi-media environment, wherein images, audio, and video components are stored in a database, multi-media object retrieval has become an area of increasing development in the arts. Current technology permits one to generate, scan, transmit, store, and manipulate large numbers of multi-media objects. In practice, these objects are typically accessed based on indexed textual associations or captions, such as Tree or Bird, which is quite useful. However, this does not permit the flexibility to search on unanticipated features, which are not part of the text caption, and are therefore inadequate to serve as an indexing mechanism for dynamic applications, wherein the search request basis is unanticipated. Many multi-media applications have the additional requirement that the database be able to select all objects which are like (or similar to) some other object. In image databases, there is the need to search through millions of images using non-text based features such as layout, texture, color, other images, and the like. Similarly, in an audio collection, it is important to search for the music scores which have some similarity characteristics to other music scores. Other examples of similarity queries on time-sequence data are to discover stocks with similar movement in stock prices, and to find all past seismological/meteorological patterns that are similar to other patterns from other years' data sets for use in analysis and forecasting.
In order for an object to be retrieved, the object may be first characterized by its features, (e.g. for images such as color, shape, etc.), before being committed to storage within the database. These individual object attributes, when taken altogether, from a feature vector. Feature extraction techniques are known in the arts especially in database image storage and retrieval for accomplishing this task. The feature set must admit some similarity measure, and must also be one that can form the basis of an index into the image collection. If the similarity measure is, for example, the properties of color, texture and shape, a query based on these would be an approximate (or similarity) query.
Many new and emerging applications require that databases be enhanced specifically with the capability to process similarity queries. A similarity search of an image database would typically be a query by a user requesting all the image objects that are similar to a given picture. In database terminology, this is often referred to as a Query-By-Example (QBE). In other words, the database would have to be capable of retrieving images which are similar to the given image based on previously extracted features.
If an image database contains 1 million (or more) objects having approximately 100 features each, the dimensionality creates problems with conventional indexing techniques. As such, it would be enormously expensive in terms of memory and access time to search this database for images with similar features, because of the high dimensionality of the feature vectors. As such, reducing the dimensionality of the indexing of the database is critical in the image database art. This is a problem in highly dimensioned indexing schemes because of the enormity of the problem of multi-dimensionality.
A plurality of database indexing methods for multi-dimensional space, have developed in the arts. The prevailing ones can be grouped roughly into three categories. The first is the R.sup.* -trees and the rest of the R-tree and k-d-B tree family. The R-tree based methods seem to be more robust for higher dimensions, provided that the fan-out of the R-tree nodes remains greater than two. The second is linear quad-trees, wherein the effort is proportional to the hyper-surface of the query region. The hyper-surface is known to grow exponentially with the dimensionality. The third is grid-files, which often require a directory that grows exponentially with the dimensionality. Typically, these schemes work well for low dimensionalities (2-d and 3-d spaces). However, the response time of most methods explodes exponentially for high dimensionalities, making sequential scanning more efficient.
Thus, one problem in this art is with approximate matching in a database for objects with many features, such as those images decomposed into multiple features. Looking up high-dimensional objects in a database is slow, because it is hard to build good indexes with large dimensionality as indicated, because most methods reduce quickly to sequential scanning, which consumes considerable time in large sets of highly dimensioned feature space. If one tries to index on a subset of the attributes (ignoring, for instance, the last half of each data vector), one typically ends up with a very inefficient index, which retrieves far too many false positives.
In addition, a dimension reduction technique must guarantee completeness. In other words, a retrieval in answer to a user's query must at least guarantee that all objects that answer the query have been retrieved. If some of the objects that could answer the query cannot be retrieved in answer to the user's query, then the database's response has not been complete. This is an intolerable situation in the database arts. Thus, completeness is one minimum criteria for any dimensionality reduction technique.
Attempts have been tried in this art to solve the indexing problem in high-dimensional spaces. For example, Friedman et at., "An Algorithm for Finding Nearest Neighbors", IEEE Trans. on Computers, (TOC) October 1975, Vol. C-24, pp. 1000-1006 discloses a method which truncates the feature vectors. Since the vectors have a high-dimensionality, the assumption is that it is permissible to categorically ignore as many of the feature components of the vector as is needed to achieve the desired level of performance and dimensionality. This approach corresponds to `projection` of the multi-dimensional points on the feature-hyperplane containing the axis, wherein searching on these projections produces multiple additional false positives. The number of false hits increases quickly with the number of dimensions, eventually reducing the method to a simple and slow sequential scan. Thus, in a feature space containing tens, hundreds, or more dimensions, the overall performance suffers as a result.
Another work, Faloutsos and Jagadish, "Diamond Tree: An Index Structure for High-Dimensionality Approximate Searching", Technical Report SRC-TR-92-97, Univ. of Maryland, October 1992, mentions that high-dimensional spaces appear often in practice and describes a method to handle points in such a space under the assumption that the feature vectors are sparse, i.e., the majority of the Feature vector's entries are zero. By limiting itself to the non-zero entries only, this method reduces the dimensionality problem to a level where traditional database indexing methods (i.e., R-trees, etc.) can be used.
Yet another work, Hou et al., "A Content-Based Indexing Technique Using Relative Geometry Features", SPIE Image Storage and Retrieval Systems (1992), Vol. 1662, pp. 59-68, discloses a content based indexing technique which is based on the theory of weighted center-of-mass. The method, which is domain-specific to medical images, considers the 4 most important feature vectors out of the original feature vectors as significant, in order to reduce the dimensionality of the index.
Therefore, it is a problem in this art to reduce the dimensionality of a high-dimensional database, while at the same time never missing any objects and reducing false positives when doing a similarity search.