It has become commonplace to use computer systems to facilitate searches of large collections of content. As content collections have become larger, and the types of content in the collections have become richer and more varied, search facility designers are facing a growing array of problems. For example, larger collections of content tend to take longer to search, and attempts to reduce search time can reduce search accuracy. Similarly, it can take longer to search through collections of more complex content types and attempts to reduce search time in this respect can also lower search accuracy. Conventional search facility implementations have shortcomings with respect to such problems.
For some content types, such as images, one approach has been to characterize pieces of content with sets of content descriptors. The content descriptor sets may be designed to enable fast search and relatively low loss of accuracy with respect to content features in which users of the search facility are interested. For example, a piece of content may be characterized with a set of feature vectors in a vector space, and distance in the vector space used as a basis to cluster and index the vectors and ultimately the content. Vector spaces with a relatively high number of dimensions (e.g., 64 and 128 dimensional vector spaces are not uncommon) may enable fine discernment with respect to features of interest. However, conventional fast search of higher dimensional spaces (e.g., aided by various indexing structures) can incur a relatively high rate of error, such as “false positive” matches, which can be harmful to search accuracy.
One conventional indexing structure is an index tree built using hierarchical k-means clustering. The feature vectors characterizing the collection of content may be clustered into sufficiently many clusters so that individual clusters may be searched rapidly. These “lowest level” clusters may themselves be characterized by vectors in the vector space, for example, by determining a mean or center vector for the cluster, then these vectors clustered in turn to form a next layer of the indexing hierarchy, and so on until there is a single cluster that may serve as a root node of the index tree. However, conventional building procedures for the index tree can be relatively taxing on computational resources. Shortages of high quality computational resources, such as high speed random access memory, can result in inconvenient and even prohibitive index tree build times. The size of content collections and/or associated content descriptor sets can become large enough that a shortage of high quality computation resources is of practical concern.
Same numbers are used throughout the disclosure and figures to reference like components and features, but such repetition of number is for purposes of simplicity of explanation and understanding, and should not be viewed as a limitation on the various embodiments.