Signal processing architectures are one of the main foundational components of the modem digital age. As is common in ordinary desktop or mobile computer applications, users are given a plurality of multimedia choices when viewing, listening, and/or interacting with data that has been processed by such systems. Before users actually utilize such data in a respective application, however, analog information is typically sampled and captured in real time via an analog-to-digital converter and processed via a Fast Fourier Transform (FFT) and/or other signal processing techniques. Sampled data is often stored in a database whereby subsequent signal processing and/or data manipulation is performed thereon. After the data has been stored, a plurality of database algorithms or techniques may be employed to retrieve such data and are described below. Unfortunately, the form of data storage such as via a floating-point format is not very conducive to efficient processing and retrieval of the data. Moreover, noise that may be present in any given sample of data may cause significant problems when determining if another previously stored and/or related sample can be located in the database. For example, if a recently captured data sample were sent to a database of stored samples that are potentially related to the captured data, and the recently captured data was taken in a noisy environment, it may be substantially difficult (or not possible) to determine if the noisy sample matches or relates to any of the previously stored samples in the database (e.g., require large amounts of processing bandwidth to determine a match, if any).
As noted above, many database techniques have evolved to locate and retrieve previously stored data such as can be provided by various tree lookup procedures. For example, there are many variants of tree lookup processes that attempt to speed-up basic nearest neighbor determinations. One of the earliest known is the k-d tree, which is a binary tree wherein the data is split, according to the value of a particular component, such that roughly half of the data falls on either side of the split, whereby the particular component is selected to maximize the variance of the data in a direction perpendicular to a corresponding hyperplane. In a test phase, a rectangle containing a test point is located by descending the tree, wherein backtracking (e.g., process of retracing a search path) is performed if the closest training point in an associated hyperectangle is such that points in adjacent rectangles may be closer. It is believed that k-d trees are somewhat limited to applications having lower dimensional structures (e.g., about 10 dimensions). In addition, the k-d tree has the property that rejection (of a point that falls farther than a threshold away from all other points in the database) can be as computationally expensive as finding the nearest neighbor.
More recently, a variety of trees—an R-tree, an R* variant, and for example S-S trees have been proposed. In these trees, processed nodes correspond to regions in space into which the data falls, so if a test point falls in a node, the other points in that node are known or assumed to be close to the test point. However, this does not obviate the need for backtracking, but facilitates making an early rejection possible—a property that k-d trees do not have. In R-trees, the nodes are populated by rectangles. R-trees are a variant that tend to minimize the area, margin and overlap of the rectangles (whereby the ‘margin’ of a rectangle may be defined as the sum of the lengths its sides), which generally results in faster lookup, and also introduces ‘forced reinsertion’, for providing a more balanced tree.
The S-S (similarity search) tree approach may even out-perform R-trees on high dimensional data. In this approach, leaves of the tree correspond to ellipsoids, in which a center and radius are defined by the data enclosed (generally, the principal axes of the ellipsoid are selected beforehand, and represent the relative importance of different dimensions). The center of the ellipsoid is thus, the centroid of the data, wherein the radius is selected to enclose the data. Again, forced reinsertion is employed to balance the tree. Other approaches have focused on how approximate matching (that is, given a query q and some set of points P, find a point p ∈ P such that ∀p′ ∈ P, d(p,q)<(1+ε)d(p′,q), for some small ε, wherein d(p,q) is a distance measure between p and q) can yield more optimal bounds on preprocessing and lookup times than exact matches provide, however, the lookup time scales as (1/ε)d, wherein d is the dimension of the space which may cause an impractical computational expense for many applications that employ higher dimensional data sets.