1. Field of the Invention
The present invention relates to a content-addressable and searchable storage system to provide effective capabilities to access, search, explore and manage massive amounts of diverse feature-rich data.
2. Description of the Related Art
The world is moving into the age where all information is digitized and where the world is interconnected by digital means. Recent studies suggest that the volume of digital data on magnetic disks as well as the capacity of a disk have been doubling every year in the past decade. If this trend continues, the capacity of a single disk will reach 1 terabyte in 2007 and 1 petabyte by 2022. As data volume and storage capacity continue to increase exponentially, storage systems, as part of the operating system, must provide new abilities to access, search, explore, and manage massive amounts of data.
A key challenge in building next-generation storage systems is to manage massive amounts of feature-rich (non-text) data, which has dominated the increasing volume of digital information. Feature-rich data are typically sensor data such as audio, images, video, genomics, or scientific data; they are noisy and high-dimensional. Current file systems are designed for named text files, and they do not have mechanisms to manage feature-rich data.
In current systems, the user must name each file and find a place to store it, and then she must know the name in order to access it later. For example, today's digital cameras automatically generate meaningless file names for their images. These file names are difficult to remember, they often are duplicative of names of files previously downloaded from the camera, and they have no correlation with the image content. To find a specific image file, the user has to look through the image thumbnails instead of the file names.
Further, current file systems use directories to organize files. Directories emulate the management of paper files and have been helpful in managing paper-like documents. Some recent file systems attempt to provide content-based search tools, but they are limited to exact searches for text and annotations of non-text data. Manual annotation, however, is not practical for feature-rich data because such data are massive, noisy and high dimensional.
Pattern matching tools, document viewers, image thumbnail generators, and directory browsers are already integral components of a modern operating systems. However, such tools are limited to exploring text documents or viewing simple images; they are not useful to explore noisy, high-dimensional data.
The management of digital data calls for a fundamentally different paradigm. A disk in the future will store significantly more data than the amount of paper data one can handle in one's lifetime; in fact, much more data than the entire Library of Congress. A paper document is inherently tied to a physical location, but this is not true for digital data. Paper management systems force users to put a file into a fixed category, and current file systems follow a similar paradigm. In contrast, feature-rich data can be organized in multiple ways and thus have many attributes, most of which are unknown at the time the data is created.
Since searching in high dimensional spaces is a challenging problem, practical proposed search solutions such as the Google search engine have been limited to searching for exact matches—they tend to work only for text documents and text annotations. Search engines such as Google index documents by building an inverted index. A number of data structures have been devised for nearest neighbor searching such as R-Trees, k-d trees, ss-trees, and SR-trees. These are capable of supporting similarity queries, but they do not scale satisfactorily to large high-dimensional data sets. Several constructions of nearest neighbor search data structures have recently been devised in the theory community, but practical implementations of those theoretical ideas for high dimensional data do not exist yet.
Similarity searching on time series or sequence data have been investigated recently. Range searches and nearest neighbor searches in whole matching and subsequence matching have been the principal queries of interest for time series data. For whole matching, several techniques have been proposed to transform the time sequence to the frequency domain by using DFT (Discrete Fourier Transform) and wavelets to reduce dimensions. For subsequence matching, solutions include I-adaptive index to solve the matching problem for searches of pre-specified lengths, PAA (Piecewise Aggregate Approximation) technique to average values of equal-size windows of the time sequence or APCA (Adaptive Piecewise Constant Approximation) to average values of variable-size windows of the time sequence of the time sequence, and a multi-resolution index data structure. These techniques focus on the specifics of time series and not a general-purpose similarity search engine.
Thus, to date, there is no practical file system with the ability to do similarity searches for noisy, high-dimensional data and there is no index engine designed for efficient similarity searches.
Recently, the theory research community has made advances in areas such as compact data structures (sketches) and dimension reduction techniques. For example, a distance function on pairs of data items can be estimated by only examining the sketches of the data items. The existence of a sketch depends crucially on the function one desires to estimate. The successful construction of a small sketch as the metadata to estimate the distance between two points in high-dimensional space has significant implications on solving the efficient similarity search problem because it can provide significant savings in space and running time.
Sketching techniques for documents (represented as sets) have been developed. The construction, based on min-wise independent permutations, was used to compute compact sketches for eliminating near-duplicate documents in the Altavista search engine. Other research introduced the notion of locality-sensitive hashing, which is a family of hash functions where the collision probability is higher for objects that are closer. Such hash functions are very useful in the construction of data structures for nearest neighbor search. A variant of locality-sensitive hashing, called similarity-preserving hashing, was investigated by co-inventor of the present invention, Moses Charikar. He developed a sketch construction for the earth mover's distance (EMD) which had been investigated and used before in the context of determining image similarity and navigating image databases. A closely related idea for sketching EMD was devised and used for image retrieval and was evaluated using exact EMD as ground truth, i.e. they were not concerned with how well their method performed compared to perceptual similarity of images.
Many other techniques have been proposed for image similarity search. One technique may be referred to as region based image retrieval (RBIR). Most RBIR systems use a combination of color, texture, shape, and spatial information to represent a region.
In C. Carson, M. Thomas, S. Belongie, J. M. Hellerstein, and J. Malik. Blobworld, “Blobworld: A system for region-based image indexing and retrieval,” In Proc. of 3rd Intl. Conf. on Visual Information and Information Systems, pages 509-516 (1999), the authors describe a technique in which each region is represented by a 218-bin color histogram, mean texture contrast and anisotropy, centroid, area, eccentricity and orientation, which is a very complicated representation.
In W. Ma and B. S. Manjunath, “NETRA: A toolbox for navigating large image databases,” Multimedia Systems, 7(3):184-198 (1999), the authors describe another technique that uses a complicated region representation. It quantizes the RGB color space into 256 colors, and each region's color is represented by {(c1, p1), . . . , (cn; pn)}, where ci is the color code and pi is the fraction of that color in the region. Texture is represented by normalized mean and standard deviation of a set of Gabor wavelet transformations with different scales and directions.
In J. R. Smith and S. F. Chang, “VisualSEEk: A fully automated content-based image query system,” In Proc. of ACM Multimedia'96, pages 87-98 (1996), the authors describe a technique that extracts salient color regions using a back-projection technique and supports joint color-spatial queries. A selection of 166 colors in the HSV color space are used. Each region is represented by a color set, region centroid, area, width and height of the minimum bounding rectangle.
In A. Natsev, R. Rastogi, and K. Shim, “WALRUS: A similarity retrieval algorithm for image databases,” In Proc. of ACM SIGMOD'99, pages 395-406 (1999), the authors describe a technique that segments each image by computing wavelet based signatures for sliding windows of various sizes and then clusters them based on the proximity of their signatures. Each region is then represented by the average signature.
In S. Ardizzoni, I. Bartolini, and M. Patella, “Windsurf: Region-based image retrieval using wavelets,” In DEXA Workshop, pages 167-173 (1999) and I. Bartolini, P. Ciaccia, and M. Patella, “A sound algorithm for region-based image retrieval using an index,” In DEXA Workshop, pages 930-934 (2000), the authors describe a technique that performs 3-level Haar wavelet transformation in the HSV color space and the wavelet coefficients of the 3rd level LL subband are used for clustering. Each region is represented by its size, centroid and corresponding covariance matrices.
In J. Z. Wang, J. Li, and G. Wiederhold, “SIMPLIcity: Semantics-sensitive integrated matching for picture libraries,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 23(9):947-963 (2001), the authors describe a system that partitions an image into 4×4 blocks and computes average color and wavelet coefficients in high frequency bands.
Current region-based image similarity measures can be roughly divided into three categories: (independent best match; (2) one-to-one match; and (3) EMD match. Independent best match systems such as Blobworld and NETRA find the best matched region for each query region and calculate the overall similarity score using fuzzy-logic operations or weighted sum. Since each query region is matched independently, multiple regions in the query image might be matched to the same region in a target image, which is undesirable in many cases. As an extreme example, consider an image A full of red balloons and a very different image B with a red ball in it. Since each red balloon in A matches the red ball in B very well, these two images will be considered very similar by independent best match.
One-to-one match systems like Windsurf and WALRUS consider matching one set of regions to another set of regions and require that each region can only be matched once. For example, Windsurf uses the Hungarian Algorithm to assign regions based on region distance. Region size is then used to adjust two matching regions' similarity. Image similarity is defined as the sum of the adjusted region similarity. One-to-One match assumes good image segmentation so there is good correspondence between two similar images' regions. But current segmentation techniques are not perfect and regions do not always correspond to objects. Moreover, it is hard to define an optimal segmentation, as one image may need different segmentations when comparing to different images.
EMD match systems use similarity measures based on the Earth Mover's Distance (EMD). Although EMD is a good measure for region matching, its effectiveness is closely linked to the underlying distance function used for pairs of regions as well as the weight given to each region. Since these systems directly use the region distance function as the ground distance for EMD and use normalized region size as the region weight, this creates problems such as regions being weighted inappropriately. As a result, these systems do not use EMD very well.
There are no commercial systems for automatic audio query with the complexity or capabilities desired for a general purpose search engine. Websites such as Findsounds.com rely on text-based searching of sound file names. The technology of Comparisonics Inc. (the developer of Findsounds.com) allows the colorized display of sound feature data once the sound is found by name, but the features are not used for the indexing/query. Other music websites such as Moodlogic.com combine filenames with user preference rankings to generate similarities for music recommendation. The largest and most popular available research system for audio segmentation, classification, and query is MARSYAS, developed by George Tzanetakis and Co-PI Perry Cook at Princeton University. This software is publicly available, and recent conferences such as the International Symposium on Music Information Retrieval, the Conference on Digital Audio Effects, and the International Computer Music Conference revealed that MARSYAS is now the basis of approximately 80% of the current research in music information retrieval.
Most research in audio query has focused on the music domain. Some recent research projects include identifying the passages within a song when a singing voice is present and identifying the singer in a complex recorded song. Another recent project is the WinPitch Corpus, which automatically aligns speech recordings with text files.
The closest related work to similarity searches for genomic data is work in clustering of gene expression matrices to identify related patterns. Many different clustering algorithms have been proposed for microarray analysis. The general goal of such algorithms is to find biologically relevant groupings of genes and/or experiments from microarray data. Hierarchical clustering using average or complete linkage is probably most widely applied. Self organizing maps (SOM) are another commonly used technique.
Other authors have suggested using mutual information relevance networks, clustering by simulated annealing, model-based clustering, graph-theoretic approaches, as well as other methods. A recent promising trend in clustering algorithms has been an emergence of methods that are probabilistic in nature, thus allowing one gene to be a member of more than one cluster. However, all these algorithms have one common and serious limitation—they define similarity over the whole gene expression vector, thus making it impossible to successfully apply these techniques to large diverse databases of expression information that cover thousands of experiments, with different sets of genes coexpressed in different subsets of experiments. This problem can be addressed by bi-clustering algorithms, but exact solution to this problem for microarray data is NP-complete. Some approximation methods have been developed recently. These include a two-sided clustering algorithms called plaid models, a biclustering method in which low-variance submatrices of the complete data matrix are found, and a bi-graph based biclustering method. However, all these algorithms are very slow and have various limitations on bicluster size and qualities. They cannot be realistically applied to databases of thousands of microarray experiments.