This invention relates generally to media identification systems, and in particular to the management of a database of reference fingerprints used by a media identification system to match unknown test samples.
Digital fingerprinting is a process that can be used to identify unknown digital media samples, such as audio or video samples. In an example media identification system, digital fingerprints are generated for each of a number of known media samples, which may be obtained from data files, broadcast programs, streaming media, or any of a variety of other media sources. Each digital fingerprint may comprise a data segment that contains characteristic information about a sample of the media from which it was generated. U.S. Pat. No. 7,516,074, which is incorporated by reference in its entirety, describes embodiments for generating characteristic digital fingerprints from a data signal.
The reference fingerprints are then stored in a database, or repository, and indexed in a way that associates the reference fingerprints with their corresponding media samples and/or metadata related to the media samples. U.S. Pat. No. 7,516,074 also discloses embodiments for indexing reference fingerprints in a database. The database of reference fingerprints can be used to identify an unknown media sample. To identify an unknown media item, a test fingerprint is generated from a sample of the media item. The test fingerprint is then matched against the database of reference fingerprints and, if a match is found, the unknown media sample is declared to be media sample associated with the matching reference fingerprint. Various exact matching and fuzzy matching algorithms and criteria for declaring a valid match may be used.
Reference fingerprints are typically indexed in the database according to a common characteristic of the fingerprints, such as a coordinate of the fingerprint vector or some other portion of the data contained in the fingerprint. This type of indexing scheme allows for a multi-staged matching process. For example, the test fingerprint may be examined to determine a preliminary match with one or more candidate sets of reference fingerprints in the database, based on the indexing scheme. Then, each of the identified candidates is compared to the test fingerprint (e.g., bitwise) to determine if there is a match. By narrowing to a list of candidates before the more computationally intensive fingerprint comparison, this multi-staged matching process avoids the necessity of accessing each and every reference fingerprint in the database and then comparing each reference fingerprint to the test fingerprint.
In some applications of a media matching system, unknown media samples are matched against an expanding set of known media samples. For example, the unknown media samples may be video clips from online video sharing websites, and these may be tested against known media samples, such as broadcast programming. As the set of known media samples grows, new reference fingerprints are generated from those samples and are then added to the reference fingerprint database.
In applications where the database of reference fingerprints is very large, the database may be implemented across a number of physical and/or logical partitions, also referred to as “silos.” When the reference database comprises multiple partitions, the reference samples are typically distributed across the partitions substantially evenly based on the amount of data contained in each partition. The particular algorithm for storing the reference fingerprints may depend on the source of the media samples from which the reference fingerprints are derived. When obtained from broadcast programming, for example, the samples may be added to the partitions according to the broadcast channel from which they were obtained, or any other meta-property of the samples.
Although this algorithm might tend to balance out the amount of data stored in each partition, it may not lead to an optimal situation for the intended use of the database. This is because in practice, there is often a correlation between the meta-properties of the media samples and their popularity. For example, in an example media matching system, the test samples will often originate more commonly from one particular source than from another. Since the indexing system would group candidates for the test sample into partitions, this would tend to lead to more accessing load (e.g., read requests) on some of the partitions as compared to other partitions. The resulting overloading of some partitions based on accessing by the media matching system would likely result in suboptimal performance of the system.